Conference paper

Knowledge Distillation Based Training of Unified Conformer CTC Models for Multi-form ASR

Abstract

There is an on-going body of research on training separate dedicated models for either short-form or long-form utterances. Multi-form acoustic models that are simply trained on combined data from long-form and short-form utterances often suffer from serious deletion and insertion problems during inference. For the deletion problem, the model predicts sequential blanks instead of words when the speaker utterances are long and continuous, while the insertion problem is often caused by short audio segments contaminated by additive noises. These problems occur especially when the input utterance comes from an acoustically unseen domain. In this paper we investigate novel techniques for training unified Conformer-based models on mixed-form speech data obtained from diverse domains and sources to serve multiple downstream applications with a single model. Our approach incorporates chunk-wise short- term discriminative knowledge distillation and mitigates the aforementioned problems that appear for single unified models. We show the benefit of our proposed technique on several ASR test sets by comparing our models against those trained by simply mixing long and short-form utterances. The proposed technique provides a significant improvement of up to 6% relative WER reduction over baseline systems that operate at a similar decoding cost.

Related