About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Conference paper
SYLLABLE-LEVEL DURATION DETERMINATION.
Abstract
Accurate prediction of duration in a text-to-speech system is essential to natural-sounding intonation. Klatt [I] proposed a set of phoneme-based rules to perform this task, but an adaptation of the rule-set to British English [2] accounted for only 68% of the variance in the duration observed in a 4000-syllable test text. Modification of these rules to incorporate foot-level effects [3,4] improved the prediction slightly to account for 71% of the variance. A similar degree of prediction can be attained, with minimum reference to segment specifics, by modelling duration at the level of the syllable, with sensitivity to stress, position in phrase and foot, and number of segments in onset, peak and coda. This supposes that micro-durational features such as shortening of segments in clusters, and lengthening of vowels to cue voicing, operate at a phonetic Ievel, within the constraints of a syllable frame, and that higher-level features determine factors of lengthening or compression for the framework into which they are to fit. In support of this view, a connectionist implementation, of eight input features, one layer of hidden units and one analog output unit, that accounts for ;m equivalent 70% of the variance in the duration is described.