W.H. Adams, Giridharan Iyengar, et al.
Eurasip Journal on Applied Signal Processing
We investigate the use of single modality confidence measures as a means of estimating adaptive,local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio- or visual-only observation probability, raised to an appropriate exponent. We consider such stream exponents as two-dimensional piecewise constant functions of the audio and visual stream local confidences, and we estimate them by minimizing the misclassification error on a held-out data set. Three stream confidence measures are investigated, namely the stream entropy, the n-best likelihood ratio average, and an n-best stream likelihood dispersion measure. The later results in superior audio-visual phonetic classification, as indicated by our experiments on a 260-subject, 40-hour long, large vocabulary, continuous speech audio-visual dataset. By using local, dispersion-based stream exponents, we achieve an additional 20% phone classification accuracy improvement over the improvement that global stream exponents add to clean audio-only phonetic classification. The performance of the algorithm however still falls significantly short of an "oracle" (cheating) confidence estimation scheme.
W.H. Adams, Giridharan Iyengar, et al.
Eurasip Journal on Applied Signal Processing
E. Eide, B. Maison, et al.
ICSLP 2000
Djamel Mostefa, Nicolas Moreau, et al.
Language Resources and Evaluation
Iain Matthews, Gerasimos Potamianos, et al.
ICME 2001