Improving ASR Robustness in Noisy Condition Through VAD Integration
Abstract
Automatic speech recognition (ASR) systems are often deployed together with a voice activity detection (VAD) system to run ASR only on the voiced acoustic signals. Although it can maintain ASR performance by removing unnecessary non-speech parts from input audio signals during inference, an error propagates when VAD fails to split speech and non-speech segments correctly. Specifically, because ASR systems are commonly constructed using segmented speech utterances only, many unexpected insertion errors can occur when VAD-segmented utterances contain a long non-speech part or only consist of non-speech. Note VAD is more prone to fail in noisy environments or in unknown acoustic domains, which triggers insertion errors in ASR more prominently. In this paper, we focus on explicitly incorporating VAD information into training of a recurrent neural network transducer (RNN-T) based ASR to make the model more robust to noisy conditions through feature integration and a multi-task learning strategy. A technique is also explored that utilizes audio-only untranscribed data by distilling VAD-related knowledge to the ASR part of the model. By combining the multi-task learning approach with the feature integration architecture, our system yields up to 10% relative improvements in very low signal-to-noise ratio (SNR) conditions compared with the system simply trained on mixed data consisting of speech and long non-speech segments.