Discriminative re-ranking for automatic speech recognition by leveraging invariant structures
An invariant structure was proposed in Minematsu (2004) and Minematsu et al. (2010) and it is a long-span feature to suppress non-linguistic factors. In contrast to frame-based features such as Mel-Frequency Cepstrum Coefficients (MFCC), the invariant structures are extracted as contrasts between speech events in a given utterance. Because the invariant structure is not a time series of short-term features, it is difficult to use it directly in the general framework of Automatic Speech Recognition (ASR) although its robustness against non-linguistic factors is desirable for ASR. To introduce the invariant structure effectively to ASR, we are working on a method to leverage the invariant structure in a discriminative re-ranking paradigm for ASR. In our re-ranking paradigm, a baseline ASR system is used to generate N-best lists with hypothesized phoneme-level alignments so that we can extract one invariant structure for each hypothesis. We also propose methods to convert an extracted invariant structure into a fixed-dimensional feature vector to be used in discriminative re-ranking. Experimental results on the three tasks of continuous digit recognition, digit recognition in noisy environments, and large vocabulary continuous speech recognition showed significant error reductions and robustness improvements against noisy environments.