About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICMI 2024
Conference paper
DoubleDistillation: Enhancing LLMs for Informal Text Analysis using Multistage Knowledge Distillation from Speech and Text
Abstract
Traditional large language models (LLMs) leverage extensive text corpora but lack access to acoustic and para-linguistic cues present in speech. There is a growing interest in enhancing text-based models with audio information. However, current models often require an aligned audio-text dataset which is frequently much smaller than typical language model training corpora. Moreover, these models often require both text and audio streams during inference/testing. In this study, we introduce a novel two-stage knowledge distillation (KD) approach that enables language models to (a) incorporate rich acoustic and paralinguistic information from speech, (b) utilize text corpora comparable in size to typical language model training data, and (c) support text-only analysis without requiring an audio stream during inference/testing. Specifically, we employ a pre-trained speech embedding teacher model (OpenAI Whisper) to train a Teacher Assistant (TA) model on an aligned audio-text dataset in the first stage. In the second stage, the TA’s knowledge is transferred to a student language model trained on a conventional text dataset. Thus, our two-stage KD method leverages both the acoustic and paralinguistic cues in the aligned audio-text data and the nuanced linguistic knowledge in a large text-only dataset. Based on our evaluation, this DoubleDistillation system consistently outperforms traditional LLMs in 15 informal text understanding tasks.