Emphatic speech prosody prediction with deep lstm networks
Controllable generation of emphasis in speech is desirable for expressive TTS systems utilized in various dialog applications. Usually such models remain voice-specific and the strength of emphasis can't be readily controlled. In this work we present a flexible emphatic prosody generation model based on Deep Recurrent Neural Networks (DRNN) for controllable word-level emphasis realization. The word emphasis DRNN model was trained on syllable-level piecewise linear prosodic trajectory parameters. A special data preprocessing technique was introduced to enable emphasis strength control, allowing to generate emphatic prosody trajectories of various strength. Additionally, we trained a DRNN model generating a sentence-level emphasis, i.e. producing whole sentences in forceful, decisive manner. Both models preserve quality and naturalness of the baseline TTS output.