Investigating factor analysis features for deep neural networks in noisy speech recognition
The problem of speaker and channel adaptation in deep neural network (DNN) based automatic speech recognition (ASR) systems is of substantial interest in advancing the performance of these systems. Recently, the speaker identity vectors (i-vectors) have shown improvements for ASR systems in matched conditions. In this paper, we propose the application of the general factor analysis framework for noisy speech recognition tasks. Several methods for deriving speaker and channel factors are explored including joint factor analysis (JFA) and i-vectors derived from DNN posteriors instead of the traditional Universal background model (UBM) approach. We also experiment with the late fusion of i-vector features with bottleneck (BN) features obtained from a previously trained convolutional neural network (CNN) system. The ASR experiments are performed on the Aspire challenge test data which contains noisy far-field speech while the acoustic models are trained with conversational telephone speech (CTS) data from the Fisher corpus. In these experiments, we show that the factor analysis based methods provide significant improvements in the word error rate (relative improvements of about 11% compared to the baseline DNN system trained with speaker adapted features).