Measuring the Effect of Linguistic Resources on Prosody Modeling for Speech Synthesis

Andrew Rosenberg; Raul Fernandez; Bhuvana Ramabhadran

doi:10.1109/ICASSP.2018.8461719

ICASSP 2018

Conference paper

10 Sep 2018

Measuring the Effect of Linguistic Resources on Prosody Modeling for Speech Synthesis

View publication

Abstract

The generation of natural and expressive prosodic contours is an important component of a text-to-speech (TTS) system which, in most classical architectures, relies on the existence of a text-analysis processor that can extract prosody-predictive features and pass them to a statistical learning model. These features can range from basic properties of the input string to rich high-level features which may not be always available when developing a TTS system in a new language with sparse computational resources. In this work we investigate how the prosody model of a speech-synthesis system performs as a function of different predictive feature sets that assume access to a certain amount of rich resources. We investigate, using objective metrics, the effect of relaxing the assumptions on input representations for prosody prediction for 5 languages, and evaluate the perceptual implications for US English.

Conference paper