IEEE Transactions on Audio, Speech and Language Processing

Exemplar-based sparse representation features: From TIMIT to LVCSR

View publication


The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNNs) and sparse representations (SRs), in speech recognition has thus far been limited. Exemplar-based techniques utilize information about individual training examples and are computationally expensive, making it particularly difficult to investigate these methods on large-vocabulary continuous speech recognition (LVCSR) tasks. While research in LVCSR provides a good testbed to tackle real-world speech recognition problems, research in this area suffers from two main drawbacks. First, the overall complexity of an LVCSR system makes error analysis quite difficult. Second, exploring new research ideas on LVCSR tasks involves training and testing state-of-the-art LVCSR systems, which can render a large turnaround time. This makes a small vocabulary task such as TIMIT more appealing. TIMIT provides a phonetically rich and hand-labeled corpus that allows easy insight into new algorithms. However, research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we combine the advantages of using both small and large vocabulary tasks by taking well-established techniques used in LVCSR systems and applying them on TIMIT to establish a new baseline. We then utilize these existing LVCSR techniques in creating a novel set of exemplar-based sparse representation (SR) features. Using these existing LVCSR techniques, we achieve a phonetic error rate (PER) of 19.4% on the TIMIT task. The additional use of SR features reduce the PER to 18.6%. We then explore applying the SR features to a large vocabulary Broadcast News task, where we achieve a 0.3% absolute reduction in word error rate (WER). © 2011 IEEE.