Improving customization of neural transducers by mitigating acoustic mismatch of synthesized audio

Gakuto Kurata; George Saon; Brian Kingsbury; David Haws; Zoltan Tuske

doi:10.21437/Interspeech.2021-1656

INTERSPEECH 2021

Conference paper

30 Aug 2021

Improving customization of neural transducers by mitigating acoustic mismatch of synthesized audio

View publication

Abstract

Customization of automatic speech recognition (ASR) models using text data from a target domain is essential to deploying ASR in various domains. End-to-end (E2E) modeling for ASR has made remarkable progress, but the advantage of E2E modeling, where all neural network parameters are jointly optimized, is offset by the challenge of customizing such models. In conventional hybrid models, it is easy to directly modify a language model or a lexicon using text data, but this is not true for E2E models. One popular approach for customizing E2E models uses audio synthesized from the target domain text, but the acoustic mismatch between the synthesized and real audio can be problematic. We propose a method that avoids the negative effect of synthesized audio by (1) adding a mapping network before the encoder network to map the acoustic features of the synthesized audio to those of the source domain, (2) training the added mapping network using text and synthesized audio from the source domain while freezing all layers in the E2E model, (3) training the E2E model with text and synthesized audio from the target domain, and (4) removing the added mapping network when decoding real audio from the target domain. Experiments on customizing RNN Transducer and Conformer Transducer models demonstrate the advantage of the proposed method over encoder freezing, a popular customization method for E2E models.

Conference paper