About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSLP 1998
Conference paper
A BOOTSTRAP TECHNIQUE FOR BUILDING DOMAIN-DEPENDENT LANGUAGE MODELS
Abstract
In this paper, we propose a new bootstrap technique to build domain-dependent language models. We assume that a seed corpus consisting of a small amount of data relevant to the new domain is available, which is used to build a reference language model. We also assume the availability of an external corpus, consisting of a large amount of data from various sources, which need not be directly relevant to the domain of interest. We use the reference language model and a suitable metric, such as the perplexity measure, to select sentences from the external corpus that are relevant to the domain. Once we have a sufficient number of new sentences, we can rebuild the reference language model. We then continue to select additional sentences from the external corpus, and this process continues to iterate until some satisfactory termination point is achieved. We also describe several methods to further enhance the bootstrap technique, such as combining it with mixture modeling and class-based modeling. The performance of the proposed approach was evaluated through a set of experiments, and the results are discussed. Analysis of the convergence properties of the approach and the conditions that need to be satisfied by the external corpus and the seed corpus are highlighted, but detailed work on these issues is deferred for the future.