Current Status of the IBM Trainable Speech Synthesis System

R. Donovan; A. Ittycheriah; Martin Franz; B. Ramabhadran; E. Eide; M. Viswanathan; R. Bakis; W. Hamza; M.A. Picheny; P. Gleason; T. Rutherfoord; P. Cox; D. Green; E. Janke; S. Revelin; C. Waast; B. Zeller; C. Guenther; J. Kunzmann

SSW 2001

Conference paper

29 Aug 2001

Current Status of the IBM Trainable Speech Synthesis System

Abstract

This paper describes the current status of the IBM Trainable Speech Synthesis System. The system is a state-of-the-art, trainable, unit-selection based concatenative speech synthesiser. The system uses hidden Markov models (HMMs) to provide a phonetic transcription and HMM state alignment of a database of single-speaker continuous-speech training data. The runtime synthesiser uses the HMM state sized segments that result as its basic synthesis units. It determines which segments to concatenate to produce a target sentence using decision trees built from the training data and a dynamic programming search to optimise a perceptually motivated cost function. The synthesiser can operate both in general domain Text-to-Speech mode, and in Phrase Splicing mode to provide higher quality synthesis in limited domains. Systems have been built in at least 10 different languages and over 70 voices.

Conference paper