About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH 2021
Conference paper
Improving customization of neural transducers by mitigating acoustic mismatch of synthesized audio
Abstract
Customization of automatic speech recognition (ASR) models using text data from a target domain is essential to deploying ASR in various domains. End-to-end (E2E) modeling for ASR has made remarkable progress, but the advantage of E2E modeling, where all neural network parameters are jointly optimized, is offset by the challenge of customizing such models. In conventional hybrid models, it is easy to directly modify a language model or a lexicon using text data, but this is not true for E2E models. One popular approach for customizing E2E models uses audio synthesized from the target domain text, but the acoustic mismatch between the synthesized and real audio can be problematic. We propose a method that avoids the negative effect of synthesized audio by (1) adding a mapping network before the encoder network to map the acoustic features of the synthesized audio to those of the source domain, (2) training the added mapping network using text and synthesized audio from the source domain while freezing all layers in the E2E model, (3) training the E2E model with text and synthesized audio from the target domain, and (4) removing the added mapping network when decoding real audio from the target domain. Experiments on customizing RNN Transducer and Conformer Transducer models demonstrate the advantage of the proposed method over encoder freezing, a popular customization method for E2E models.