About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICASSP 2004
Conference paper
Audio visual word spotting
Abstract
The task of word spotting is to detect and verify some specific words embedded in unconstrained speech. Most Hidden Markov Model(HMM)-based word spotters have the same noise robustness problem as a speech recognizer. The performance of a word spotter will drop significantly under noisy environment. Visual speech information has been shown to improve noise robustness of speech recognizer[1][2][3]. In this paper, we combine the visual speech information to improve the noise robustness of the word spotter. In visual frontend processing, the Information-Based Maximum Discrimination(IBMD)[4] algorithm is used to detect the face/mouth corners. In audiovisual fusion, the feature-level fusion is adopted. We compare the audio-visual word-spotter with the audio-only spotter and show the advantage of the former approach over the latter.