About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH 2012
Conference paper
Discriminative feature-space transforms using deep neural networks
Abstract
We present a deep neural network (DNN) architecture which learns time-dependent offsets to acoustic feature vectors according to a discriminative objective function such as maximum mutual information (MMI) between the reference words and the transformed acoustic observation sequence. A key ingredient in this technique is a greedy layer-wise pretraining of the network based on minimum squared error between the DNN outputs and the offsets provided by a linear feature-space MMI (FMMI) transform. Next, the weights of the pretrained network are updated with stochastic gradient ascent by backpropagating the MMI gradient through the DNN layers. Experiments on a 50 hour English broadcast news transcription task show a 4% relative improvement using a 6-layer DNN transform over a state-of-the-art speaker-adapted system with FMMI and modelspace discriminative training.