Combining the flexibility of speech synthesis with the naturalness of pre-recorded audio: A comparison of two approaches to phrase-splicing TTS
Abstract
Many applications of TTS incorporate both unpredictable words, which require the flexibility of TTS, and static phrases, for which the quality of recorded speech is unmatched by TTS. "Phrase-splicing" TTS attempts to provide the optimal combination of the two, by customizing concatenative TTS to such applications by incorporating application-specific recordings at the word or phrase level while resorting to smaller-unit synthesis to fill the gaps not covered by those recordings. In the past, we have achieved this by using a word-level search on the application-specific recordings followed by a generalpurpose TTS search, in our case using sub-phonetic units, to fill the gaps. However, recent trends toward larger-unit roles in general-purpose TTS suggest a single-search approach for phrase splicing. A listening test shows that we achieve at least as high quality with the new one-search algorithm as with two-search.