Saurabh Paul, Christos Boutsidis, et al.
JMLR
There is an on-going body of research on training separate dedicated models for either short-form or long-form utterances. Multi-form acoustic models that are simply trained on combined data from long-form and short-form utterances often suffer from various negative impacts due to the diversity of a speaking style, an accent, and a recording condition. In addition, a linguistic mismatch that comes from an utterance length is also another factor of the degradation. In this paper we investigate novel techniques for training unified Conformer-based models on multi-form speech data obtained from diverse domains and sources to serve multiple downstream applications with a single model. Our approach incorporates chunk-wise short-term discriminative knowledge distillation with an encoder embedding masking and mitigates the aforementioned problems that appear for single unified models. We show the benefit of our proposed technique on long and short-form ASR test sets by comparing our models against several variants trained by mixing utterances with various audio lengths. The proposed technique provides a significant improvement of up to 8.5% relative WER reduction over baseline systems that operate at a similar decoding cost.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023