About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH 2022
Conference paper
WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models
Abstract
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WAVPROMPT, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WAVPROMPT is a few-shot learner that can perform speech understanding tasks better than a naïve text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WAVPROMPT can extract more information than just the transcriptions. The source code is available at https://github.com/Hertin/WavPrompt.