About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Abstract
Recently, speaker adaptation of neural TTS models received significant interest, and several studies focusing on this topic have been published. All of them explore an adaptation of an initial multi-speaker model trained on a corpus containing from tens to hundreds of individual speaker voices.In this work we focus on a challenging task of TTS voice conversion where an initial system is trained on a single-speaker data and then need to be adapted to a variety of external speaker voices. The TTS voice conversion setup represents a very important use case. Transcribed multi-speaker datasets might be unavailable for many languages while any TTS technology provider is expected to have at least one suitable single-speaker dataset per supported language.We present a neural TTS system comprising separate prosody generator and synthesizer DNN models. The system is trained on a high quality proprietary male speaker dataset. We show that the system models can be converted to a variety of external male and female ordinary voices and an extremely expressive artist's voice and present crowd-base subjective evaluation results.