Knowledge distillation across ensembles of multilingual models for low-resource languages
This paper investigates the effectiveness of knowledge distillation in the context of multilingual models. We show that with knowledge distillation, Long Short-Term Memory(LSTM) models can be used to train standard feed-forward Deep Neural Network (DNN) models for a variety of low-resource languages. We then examine how the agreement between the teacher's best labels and the original labels affects the student model's performance. Next, we show that knowledge distillation can be easily applied to semi-supervised learning to improve model performance. We also propose a promising data selection method to filter un-transcribed data. Then we focus on knowledge transfer among DNN models with multilingual features derived from CNN+DNN, LSTM, VGG, CTC and attention models. We show that a student model equipped with better input features not only learns better from the teacher's labels, but also outperforms the teacher. Further experiments suggest that by learning from each other, the original ensemble of various models is able to evolve into a new ensemble with even better combined performance.