Data-driven approach to designing compound words for continuous speech recognition
Abstract
In this paper, we present a new approach to deriving compound words from a training corpus. The motivation for making compound words is because under some assumptions, speech recognition errors occur less frequently in longer words. Furthermore, they also enable more accurate modeling of pronunciation variability at the boundary between adjacent words in a continuously spoken utterance. We introduce a measure based on the product between the direct and the reverse bigram probability of a pair of words for finding candidate pairs in order to create compound words. Our experimental results show that by augmenting both the acoustic vocabulary and the language model with these new tokens, the word recognition accuracy can be improved by absolute 2.8% (7% relative) on a voicemail continuous speech recognition task. We also compare the proposed measure for selecting compound words with other measures that have been described in the literature.