About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICASSP 2018
Conference paper
Emphatic speech prosody prediction with deep lstm networks
Abstract
Controllable generation of emphasis in speech is desirable for expressive TTS systems utilized in various dialog applications. Usually such models remain voice-specific and the strength of emphasis can't be readily controlled. In this work we present a flexible emphatic prosody generation model based on Deep Recurrent Neural Networks (DRNN) for controllable word-level emphasis realization. The word emphasis DRNN model was trained on syllable-level piecewise linear prosodic trajectory parameters. A special data preprocessing technique was introduced to enable emphasis strength control, allowing to generate emphatic prosody trajectories of various strength. Additionally, we trained a DRNN model generating a sentence-level emphasis, i.e. producing whole sentences in forceful, decisive manner. Both models preserve quality and naturalness of the baseline TTS output.