Publication
IEEE Transactions on Audio, Speech and Language Processing
Paper

Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint

View publication

Abstract

In statistical HMM-based text-to-speech systems (STTS), speech feature dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over-smoothing of speech features, often sounding as muffled synthesized speech. In this correspondence, we propose a method to enhance a baseline STTS system by introducing a segment-wise model representation with a norm constraint. The segment-wise representation provides additional degrees of freedom in speech feature determination. We exploit these degrees of freedom for increasing the speech feature vector norm to match a norm constraint. As a result, statistically generated speech features are less over-smoothed, resulting in more natural sounding speech, as judged by listening tests. © 2006 IEEE.

Date

Publication

IEEE Transactions on Audio, Speech and Language Processing

Authors

Share