Improving RNN-Transducers with Acoustic lookahead
Abstract
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech-to-text conversion. A typical RNN-T generates the text autoregressively by independently encoding the audio and previous text, and combining the two encodings by a thin network. While the existing RNN-T architecture provides SOTA streaming accuracy, the RNN-T outputs often overly rely on the language context to generate the next words without being supported by acoustic evidence. We aim to fix this inherent limitation of RNN-T models by conditioning the text encoder on (noisy) hypotheses purely based on the audio encoder, for a fixed number of future time steps. This technique yields a significant 30% relative reduction in word error rate on the Librispeech benchmark, without any degradation on baseline performance on out-of-domain evaluation sets.