Dimensional reduction, covariance modeling, and computational complexity in ASR systems
Abstract
In this paper, we study acoustic modeling for speech recognition using mixtures of exponential models with linear and quadratic features tied across all context dependent states. These models are one version of the SPAM models introduced in [1]. They generalize diagonal covariance, MLLT, EMLLT, and full covariance models. Reduction of the dimension of the acoustic vectors using LDA/HDA projections corresponds to a special case of reducing the exponential model feature space. We see, in one speech recognition task, that SPAM models on an LDA projected space of varying dimensions achieve a significant fraction of the WER improvement in going from MLLT to full covariance modeling, while maintaining the low computational cost of the MLLT models. Further, the feature precomputation cost can be minimized using the hybrid feature technique of [2]; and the number of Gaussians one needs to compute can be greatly reducing using hierarchical clustering of the Gaussians (with fixed feature space). Finally, we show that reducing the quadratic and linear feature spaces separately produces models with better accuracy, but comparable computational complexity, to LDA/HDA based models.