A variational approach to robust maximum likelihood estimation for speech recognition
Abstract
In many automatic speech recognition (ASR) applications, the data used to estimate the class-conditional feature probability density function (PDF) is noisy, and the test data is mismatched with the training data. Previous research has shown that the effect of this problem may be reduced by using models which take the effect of the noise into consideration, and by transforming the features or the models used in the classifier to adapt to new environments and speakers. This paper addresses the degradation in the performance of ASR systems due to small - possibly time-varying - perturbations of the training data. To approach this problem, we provide a computationally efficient algorithm for estimating the model parameters which maximize the sum of the log likelihood and the negative of a measure of the sensitivity of the estimated likelihood to these perturbations. This approach does not make any assumptions about the noise model during training. We present several large vocabulary speech recognition experiments that show significant recognition accuracy improvement compared to using the baséline maximum likelihood (ML) models.