CVPRW 2019
Conference paper

Grounding spoken words in unlabeled video


In this paper, we explore deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized. Specifically, we consider cooking show videos from the YouCook2 dataset and a subset of the YouTube-8M dataset. We introduce varying levels of supervision into the learning process to guide the sampling of audio-visual pairs for training the models. This includes (1) a fully-unsupervised approach that samples audio-visual segments uniformly from an entire video, and (2) sampling audio-visual segments using weak supervision from off-the-shelf automatic speech and visual recognition systems. Although these models are preliminary, even with no supervision they are capable of learning cross-modal correlations, and with weak supervision we see significant amounts of cross-modal learning.