Frank R. Libsch, Takatoshi Tsujimura
Active Matrix Liquid Crystal Displays Technology and Applications 1997
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
Frank R. Libsch, Takatoshi Tsujimura
Active Matrix Liquid Crystal Displays Technology and Applications 1997
Martin Charles Golumbic, Renu C. Laskar
Discrete Applied Mathematics
Richard M. Karp, Raymond E. Miller
Journal of Computer and System Sciences
Jonathan Ashley, Brian Marcus, et al.
Ergodic Theory and Dynamical Systems