N-best based stochastic mapping on stereo HMM for noise robust speech recognition
Abstract
In this paper we present an extension of our previously proposed feature space stereo-based stochastic mapping (SSM). As distinct froman auxiliary stereo Gaussian mixture model in the front-end in our previous work, a stereo HMM model in the back-end is used. The basic idea, as in feature space SSM, is to form a joint space of the clean and noisy features, but to train a Gaussian mixture HMM in the new space. The MMSE estimation, which is the conditional expectation of the clean speech given the sequence of noisy observations, leads to clean speech predictors at the granularity of the Gaussian distributions in the HMM model. Because the Gaussians are not known during decoding, N-best hypotheses are employed. This results in a clean speech predictorwhich is a weighted (by posteriors) sum of the estimates from different Gaussian distributions. In experimental evaluation of the proposed method on the Aurora 2 database it gives better performance over the MST model, particularly, about 10%-20% relative improvement under unseen noise conditions. Copyright © 2008 ISCA.