Conference paper

UniAVLM: Unified Large Audio-Visual Language Models for Comprehensive Video Understanding

Abstract

Modern video understanding requires integrating multimodal signals, but current Multimodal Large Language Models (MLLMs) often process audio and visual streams separately, missing key relationships and causing fragmented understanding with a disjointed audio-visual representation. In this work, we propose UniAVLM, a large audio-visual language model for comprehensive video understanding, which first employing Whisper-style audio feature extraction to capture relevant auditory information. We then introduce spatiotemporal position encoding to enhance the video representation with temporal dynamics. Finally, we implement cross-modal attention mechanisms to explicitly fuse the audio and visual features, allowing the model to learn the intricate relationships between these modalities and creating a cohesive multimodal representation. We conduct extensive experiments on the Audio-Visual Scene-Aware Dialogue (AVSD) benchmark, comparing our model against seven representative multimodal baselines and demonstrate state-of-the-art performance, with our model achieving 48.91% accuracy and 89.93 BERTScore-F1. Specifically, our model outperforms the best vision-language model by 6.79% accuracy, and surpasses the state-of-the-art full multimodal model by 4.07% accuracy, while using only parameter-efficient fine-tuning. Comprehensive ablation studies highlight the critical impact of lightweight integration strategies and thorough cross-modal fusion on comprehensive video understanding.