About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IEEE/ACM TASLP
Paper
Maximum likelihood nonlinear transformations based on deep neural networks
Abstract
Feature transformations are commonly used in speech recognition to account for distribution mismatches between the source and target domains (also referred to as covariate shift). Linear (affine) or piecewise linear transformations are typically considered. In this paper, we present deep neural network (DNN) based nonlinear feature transformations estimated under the maximum likelihood criterion. We use the hidden Markov model (HMM) to model speech feature sequences and features in each HMM state assume a Gaussian mixture model (GMM) distribution. The network is pre-trained close to a linear transformation followed by a fine-tuning using the gradient descent algorithm. Due to the nonlinearity, the gradients and the partition functions of GMM-HMM state distributions are evaluated using the Monte Carlo (MC) method based on importance sampling. In addition, a deep stacked architecture is proposed to hierarchically build a DNN as a series of sub-networks with each representing a nonlinear transformation itself, which can be learned using a block-wise learning strategy. Applications of the proposed nonlinear transformations in speaker/environment adaptation and acoustic modeling in large vocabulary continuous speech recognition tasks show its superior performance over the widely-used constrained maximum likelihood linear regression (CMLLR).