Publication
ICSLP 1998
Conference paper

THE IBM TRAINABLE SPEECH SYNTHESIS SYSTEM

Abstract

The speech synthesis system described in this paper uses a set of speaker-dependent decision-tree state-clustered hidden Markov models to automatically generate a leaf level segmentation of a large single-speaker continuous-read-speech database. During synthesis, the phone sequence to be synthesised is converted to an acoustic leaf sequence by descending the HMM decision trees. Duration, energy and pitch values are predicted using separate trainable models. To determine the segment sequence to concatenate, a dynamic programming (d.p.) search is performed over all the waveform segments aligned to each leaf in training. The d.p. attempts to ensure that the selected segments join each other spectrally, and have durations, energies and pitches such that the amount of degradation introduced by the subsequent use of TD-PSOLA is minimised. Algorithms embedded within the d.p. can alter the required acoustic leaf sequence, duration and energy values to ensure high quality synthetic speech. The selected segments are concatenated and modified to have the required prosodic values using the TD-PSOLA algorithm. The d.p. results in the system effectively selecting variable length units, based upon its leaf level framework.

Date

Publication

ICSLP 1998

Authors

Share