ICIP 2001
Conference paper

Improved ROI and within frame discriminant features for lipreading


We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal information to improve visual speech modeling. In particular, with respect to (a), we propose a ROI that includes the speaker's jaw and cheeks, in addition to the traditionally used mouth/lip region; with respect to (b) and (c), we propose the use of a two-stage linear discriminant analysis, both within frame, as well as across a large number of frames. On a large-vocabulary, continuous speech audio-visual database, the proposed visual features result in a 13% absolute reduction in visual-only word error rate over a baseline visual front end, and in an additional 28% relative improvement in audio-visual over audio-only phonetic classification accuracy.



ICIP 2001


