Using random forest language models in the IBM RT-04 CTS system

Peng Xu; Lidia Mangu

INTERSPEECH - Eurospeech 2005

Conference paper

01 Dec 2005

Using random forest language models in the IBM RT-04 CTS system

Abstract

One of the challenges in large vocabulary speech recognition is the availability of large amounts of data for training language models. In most state-of-the-art speech recognition systems, n-gram models with Kneser-Ney smoothing still prevail due to their simplicity and effectiveness. In this paper, we study the performance of a new language model, the random forest language model, in the IBM conversational telephony speech recognition system. We show that although the random forest language models are designed to deal with the data sparseness problem, they also achieve statistically significant improvements over n-gram models when the training data has over 500 million words.

Conference paper