Efficient likelihood computation in multi-stream HMM based audio-visual speech recognition
Abstract
Multi-stream hidden Markov models have recently been introduced in the field of automatic speech recognition as an alternative to single-stream modeling of sequences of speech informative features. In particular, they have been very successful in audio-visual speech recognition, where features extracted from video of the speaker's lips are also available. However, in contrast to single-stream modeling, their use during decoding becomes computationally intensive, as it requires calculating class-conditional likelihoods of the added stream observations. In this paper, we propose a technique to reduce this overhead by drastically limiting the number of observation probabilities computed for the visual stream. The algorithm estimates a joint co-occurrence mapping of the Gaussian mixture components that separately model the audio and visual observations, and uses it to select the visual mixture components to be evaluated, given the already selected audio ones. We report experiments using this scheme on a connected-digits audio-visual database, where we demonstrate significant speed gains at decoding with only about 5% of the visual Gaussian components requiring evaluation, as compared to the independent evaluation of audio and visual likelihoods.