Audio visual word spotting
Abstract
The task of word spotting is to detect and verify some specific words embedded in unconstrained speech. Most Hidden Markov Model(HMM)-based word spotters have the same noise robustness problem as a speech recognizer. The performance of a word spotter will drop significantly under noisy environment. Visual speech information has been shown to improve noise robustness of speech recognizer[1][2][3]. In this paper, we combine the visual speech information to improve the noise robustness of the word spotter. In visual frontend processing, the Information-Based Maximum Discrimination(IBMD)[4] algorithm is used to detect the face/mouth corners. In audiovisual fusion, the feature-level fusion is adopted. We compare the audio-visual word-spotter with the audio-only spotter and show the advantage of the former approach over the latter.