Supervised and unsupervised active learning for automatic speech recognition of low-resource languages
Abstract
Automatic speech recognition (ASR) systems rely on large quantities of transcribed acoustic data. The collection of audio data is relatively cheap, whereas the transcription of that data is relatively expensive. Thus there is an interest in the ASR community in active learning, in which only a small subset of highly representative data chosen from a large pool of untranscribed audio need be transcribed in order to approach the performance of the system trained with much larger amounts of transcribed audio. In this paper, we compare two basic approaches to active learning: a supervised approach in which we build a speech recognition system from a small amount of seed data in order to make the selection of a limited amount of additional audio for transcription, and an unsupervised approach in which no intermediate system recognition system built from seed data is necessary. Our best unsupervised approach performs quite close to our supervised approach, with both outperforming a random selection scheme.