About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH - Eurospeech 2003
Conference paper
Audio-visual speech recognition in challenging environments
Abstract
Visual speech information is known to improve accuracy and noise robustness of automatic speech recognizers. However, todate, all audio-visual ASR work has concentrated on "visually clean" data with limited variation in the speaker's frontal pose, lighting, background. In this paper, we investigate audiovisual ASR in two practical environments that present significant challenges to robust visual processing: (a) Typical offices, where data are recorded by means of a portable PC equipped with an inexpensive web camera, (b) automobiles, with data collected at three approximate speeds. The performance of all components of a state-of-the-art audio-visual ASR system is reported on these two sets benchmarked against "visually clean" data recorded in a studio-like environment. Not surprisingly, both audio-visual-only ASR degrade, more than doubling their respective word error rates. Nevertheless, visual speech remains beneficial to ASR.