Deep neural network training emphasizing central frames

Gakuto Kurata; Daniel Willett

INTERSPEECH 2015

Conference paper

06 Sep 2015

Deep neural network training emphasizing central frames

Abstract

It is common practice to concatenate several consecutive frames of acoustic features as input of a Deep Neural Network (DNN) for speech recognition. A DNN is trained to map the concatenated frames as a whole to the HMM state corresponding to the center frame while the side frames close to both ends of the concatenated frames and the remaining central frames are treated as equally important. Though the side frames are relevant to the HMM state of the center frame, this relationship may not be fully generalized to unseen data. Thus putting more emphasis on the central frames than on the side frames avoids overfitting to the DNN training data. We propose a new DNN training method to emphasize the central frames. We first conduct pre-training and fine-tuning with only the central frames and then conduct fine-tuning with all of the concatenated frames. In large vocabulary continuous speech recognition experiments with more than 1,000 hours of data for DNN training, we obtained a relative error rate reduction of 1.68%, which was statistically significant.

Conference paper