End-to-end speech recognition and keyword search on low-resource languages
In recent years, so-called, 'end-to-end' speech recognition systems have emerged as viable alternatives to traditional ASR frameworks. Keyword search, localizing an orthographic query in a speech corpus, is typically performed by using automatic speech recognition (ASR) to generate an index. Previous work has evaluated the use of end-to-end systems for ASR on well known corpora (WSJ, Switchboard, TIMIT, etc.) in high-resource languages like English and Mandarin. In this work, we investigate the use of Connectionist Temporal Classification (CTC) networks, recurrent encoder-decoders with attention, two end-to-end ASR systems for keyword search and speech recognition on low resource languages. We find end-to-end systems can generate high quality 1-best transcripts on low-resource languages, but, because they generate very sharp posteriors, their utility is limited for KWS. We explore a number of ways to address this limitation with modest success. Experimental results reported are based on the IARPA BABEL OP3 languages and evaluation framework. This paper represents the first results using 'end-to-end' techniques for speech recognition and keyword search on low-resource languages.