Knowledge distillation based training of universal ASR source models for cross-lingual transfer
Abstract
In this paper we introduce a novel knowledge distillation based framework for training universal source models. In our proposed approach for automatic speech recognition (ASR), multilingual source models are first trained using multiple languagedependent resources before being used to initialize language specific target models in low resource settings. For the proposed source models to be effective in cross-lingual transfer to novel target languages, the training framework encourages the models to perform accurate universal phone classification while ignoring any language-dependent characteristics present in the training data set. These two goals are achieved by applying knowledge distillation to improve the models' universal phone classification performance along with a shuffling mechanism that alleviates any language specific dependencies that might be learned. The benefits of this proposed technique are demonstrated in several practical settings, where either large amounts or only limited quantities of unbalanced multilingual data resources are available for source model creation. Compared to a conventional knowledge transfer learning method, the proposed approaches achieve a relative WER reduction of 8-10% in streaming ASR settings for various low resource target languages.