Network architectures for multilingual speech representation learning
Abstract
Multilingual (ML) representations play a key role in building speech recognition systems for low resource languages. The IARPA sponsored BABEL program focuses on building speech recognition (ASR) and keyword search (KWS) systems in over 24 languages with limited training data. The most common mechanism to derive ML representations in the BABEL program has been with the use of a two-stage network, the first stage being a convolutional network (CNN) from where multilingual features are extracted, expanded contextually and used as input to the second stage which can be a feed-forward DNN or a CNN. The final multilingual representations are derived from the second network. This paper presents two novel methods for deriving ML representations. The first is based on Long-Short Term Memory (LSTM) networks and the second is based on a very deep CNN (VGG-net). We demonstrate that ML features extracted from both models show significant improvement over the baseline CNN-DNN based ML representations, in terms of both speech recognition and keyword search performance and draw the comparison between the LSTM model itself and the ML representations derived from it on Georgian, the surprise language for the OpenKWS evaluation.