Global RNN Transducer Models For Multi-dialect Speech Recognition
Constructing single, unified automatic speech recognition (ASR) models that work effectively across various dialects of a language is a challenging problem. Although many recently proposed approaches are effective, they are computationally more expensive compared to the conventional approach of using ASR models designed separately for each dialect. In this paper, we propose a novel modeling technique for constructing accurate, multi-dialect, speech recognition systems with a single unified model, based on recurrent neural network transducers (RNN-T), which does not incur any extra computational costs at decoding time. Once a model has been created, the same decoding settings can also be used across all dialects. In our proposed approach, an RNN-T model with a shared encoder, common joint network and multi-branch prediction networks is first constructed. After training each prediction network on an ASR task corresponding to various dialects, an effective interpolation step combines the multi-branch prediction networks back into a computationally-efficient single branch. The effectiveness of the proposed technique is shown on ASR tasks on major English dialects. The proposed method approaches oracle performance and improves by 15-30% relative over dialect-specific models in dialect agnostic conditions.