Conference paper

Spoken question answering for visual queries

Abstract

Visual question answering models are textual language models with additional visual input. These model can provide various types of information on images. In this work, we take an existing visual-questions answering model (LLaVA) and extend it to support spoken questions. This is done by adding a speech encoder and aligning it to the input space of the language model. The model now have textual, visual and spoken inputs and can provide spoken visual-question answering functionality.

For training and testing this model we need appropriate datasets but such datasets are currently not available. We address this problem by generating several datasets using human speech and convert other datasets using synthesized speech.
We examine the impact of synthesized speech on training and testing by using two high-quality TTS model.

We find that even when using synthetic data we can train a good model although the accuracy isn't as good as when the questions are text only. We show that the choice of TTS has only small impact on accuracy.