About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH 2024
Conference paper
SALSA: Speedy ASR-LLM Synchronous Aggregation
Abstract
Automatic speech recognition (ASR) systems still lag in performance on low-resource languages. The rise of multilingual large language models (LLMs) offers the potential for effective integration with ASR systems to improve its performance on low-resource languages. One major challenge towards achieving this goal is that the tokenization of the LLM and the ASR systems differ. In this work, we propose SALSA – a synchronous, lightweight solution to merge pretrained ASR and LLM systems with varying token vocabularies. The LLM’s predictions are tokenized using the ASR system to unroll its decoder; the last ASR decoder state is then mapped using a learnable projection and added as a residual connection to the LLM’s representations. SALSA is parameter-efficient using learned projection layers only for a select set of layers in the ASR and LLM decoders. We evaluate SALSA on more than 10 low-resource languages in the FLEURS benchmark yielding substantial WER reductions of up to 36%.