About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSLP 2000
Conference paper
Stream confidence estimation for audio-visual speech recognition
Abstract
We investigate the use of single modality confidence measures as a means of estimating adaptive,local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio- or visual-only observation probability, raised to an appropriate exponent. We consider such stream exponents as two-dimensional piecewise constant functions of the audio and visual stream local confidences, and we estimate them by minimizing the misclassification error on a held-out data set. Three stream confidence measures are investigated, namely the stream entropy, the n-best likelihood ratio average, and an n-best stream likelihood dispersion measure. The later results in superior audio-visual phonetic classification, as indicated by our experiments on a 260-subject, 40-hour long, large vocabulary, continuous speech audio-visual dataset. By using local, dispersion-based stream exponents, we achieve an additional 20% phone classification accuracy improvement over the improvement that global stream exponents add to clean audio-only phonetic classification. The performance of the algorithm however still falls significantly short of an "oracle" (cheating) confidence estimation scheme.