About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICASSP 2017
Conference paper
Network architectures for multilingual speech representation learning
Abstract
Multilingual (ML) representations play a key role in building speech recognition systems for low resource languages. The IARPA sponsored BABEL program focuses on building speech recognition (ASR) and keyword search (KWS) systems in over 24 languages with limited training data. The most common mechanism to derive ML representations in the BABEL program has been with the use of a two-stage network, the first stage being a convolutional network (CNN) from where multilingual features are extracted, expanded contextually and used as input to the second stage which can be a feed-forward DNN or a CNN. The final multilingual representations are derived from the second network. This paper presents two novel methods for deriving ML representations. The first is based on Long-Short Term Memory (LSTM) networks and the second is based on a very deep CNN (VGG-net). We demonstrate that ML features extracted from both models show significant improvement over the baseline CNN-DNN based ML representations, in terms of both speech recognition and keyword search performance and draw the comparison between the LSTM model itself and the ML representations derived from it on Georgian, the surprise language for the OpenKWS evaluation.