INTERSPEECH - Eurospeech 2003
Conference paper

An architecture for rapid decoding of large vocabulary conversational speech


This paper addresses the question of howto design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-of-the-art speaker adaptation, and run in one times real time1 (1×RT). The architecture we propose is based on classical HMMViterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space andmodel-spaceMLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation.