Use of micro-modulation features in large vocabulary continuous speech recognition tasks

Dimitrios Dimitriadis; Enrico Bocchieri

doi:10.1109/TASLP.2015.2430815

IEEE/ACM TASLP

Paper

01 Jan 2015

Use of micro-modulation features in large vocabulary continuous speech recognition tasks

View publication

Abstract

Most of the state-of-the-art ASR systems take as input a single type of acoustic features, dominated by the traditional feature schemes, i.e., MFCCs or PLPs. However, these features cannot model rapid, intra-frame phenomena present in the actual speech signals. On the other hand, micro-modulation components, inspired by the AM-FM speech model, can capture these important characteristics of spoken speech, resulting in significant performance improvements, as previously shown in small-vocabulary ASR tasks. Yet, they have limited use in large vocabulary ASR applications, where feature post-processing schemes are usually employed. To enable the successful application of these frequency measures in real-life tasks, we investigate their combination with the traditional Cepstral features when employing linear, e.g., HDA, and nonlinear, i.e., bottleneck neural net (BN), feature transforms. This feature combination is investigated in the context of the hybrid DNN-HMM framework, as well. The experimental results reveal that the integration of micro-modulation and Cepstral features, using neural nets, can greatly improve the ASR performance with respect to using the Cepstral features alone. We apply this novel feature extraction approach on different tasks, i.e., a clean speech task (DARPA-WSJ), the Aurora-4 task and a real-life, open-vocabulary, mobile search task, the Speak4it, always reporting improved performance, while the obtained relative word error reduction ranges between 7%-21% depending on the task, e.g., a relative WER improvement of 18% for the Speak4it task, and similar improvements, up to 21%, for the WSJ task are reported.

Conference paper