Emphatic speech prosody prediction with deep lstm networks

Slava Shechtman; Moran Mordechay

doi:10.1109/ICASSP.2018.8462473

ICASSP 2018

Conference paper

10 Sep 2018

Emphatic speech prosody prediction with deep lstm networks

View publication

Abstract

Controllable generation of emphasis in speech is desirable for expressive TTS systems utilized in various dialog applications. Usually such models remain voice-specific and the strength of emphasis can't be readily controlled. In this work we present a flexible emphatic prosody generation model based on Deep Recurrent Neural Networks (DRNN) for controllable word-level emphasis realization. The word emphasis DRNN model was trained on syllable-level piecewise linear prosodic trajectory parameters. A special data preprocessing technique was introduced to enable emphasis strength control, allowing to generate emphatic prosody trajectories of various strength. Additionally, we trained a DRNN model generating a sentence-level emphasis, i.e. producing whole sentences in forceful, decisive manner. Both models preserve quality and naturalness of the baseline TTS output.

Conference paper