About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICASSP 2010
Conference paper
An autoencoder neural-network based low-dimensionality approach to excitation modeling for HMM-based text-to-speech
Abstract
HMM-TTS synthesis is a popular approach toward flexible, low-foot-print, data driven systems that produce highly intelligible speech. In spite of these strengths, speech generated by these systems exhibit some degradation in quality, attributable to an inadequacy in modeling the excitation signal that drives the parametric models of the vocal tract. This paper proposes a novel method for modeling the excitation as a low-dimensional set of coefficients, based on a non-linear map learned through an autoencoder. Through analysis-and-resynthesis experiments, and a formal listening test, we show that this model produces speech of higher perceptual quality compared to conventional pulse-excited speech signals at the p < 0.01 significance level. ©2010 IEEE.