Unfolded recurrent neural networks for speech recognition
George Saon, Hagen Soltau, et al.
INTERSPEECH 2014
Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech-synthesis system. However, prosody prediction can be affected by an interaction of short- And long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.
George Saon, Hagen Soltau, et al.
INTERSPEECH 2014
Sören Bleikertz, Carsten Vogel, et al.
ACSAC 2014
Robert Moore, Eric Young Liu, et al.
CUI 2020
Jean M.R. Costa, Marcelo Cataldo, et al.
CHI 2011