Alignment-Length Synchronous Decoding for RNN Transducer
We present a beam decoding strategy for recurrent neural network transducers which has the characteristic that all competing hypotheses within the beam have the same alignment length (number of output symbols plus BLANK symbols). We contrast the proposed technique with time-synchronous decoding where the competing hypotheses within the beam correspond to the same input frames (but can have different length output sequences). Experiments on the Switchboard 2000 hours corpus show that alignment-length synchronous decoding (ALSD) is 25% faster than time-synchronous decoding (TSD) for the same accuracy because ALSD performs 42% fewer joint network evaluations and hypothesis expansions during the search. Additionally, we discuss the benefit of caching and batching the prediction and joint network evaluations, of using prefix trees instead of full output vocabulary expansions, and of performing hypothesis recombination after pruning. With open beam decoding, we reach a 6.2% / 10.9% word error rate on the Switchboard and CallHome Hub5 2000 evaluation testsets which compares favorably to other published single-model results on this corpus.