Weakly-supervised phrase assignment from text in a speech-synthesis system using noisy labels
The proper segmentation of an input text string into meaningful intonational phrase units is a fundamental task in the text-processing component of a text-to-speech (TTS) system that generates intelligible and natural synthesis. In this work we look at the creation of a symbolic, phrase-assignment model within the front end (FE) of a North American English TTS system when high-quality labels for supervised learning are unavailable and/or potentially mismatched to the target corpus and domain. We explore a labeling scheme that merges heuristics derived from (i) automatic high-quality phonetic alignments, (ii) linguistic rules, and (iii) a legacy acoustic phrase-labeling system to arrive at a ground truth that can be used to train a bidirectional recurrent neural network model. We evaluate the performance of this model in terms of objective metrics describing categorical phrase assignment within the FE proper, as well as on the effect that these intermediate labels carry onto the TTS back end for the task of continuous prosody prediction (i.e., intonation and duration contours, and pausing). For this second task, we rely on subjective listening tests and demonstrate that the proposed system significantly outperforms a linguistic rules-based baseline for two different synthetic voices.