About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSLP 2004
Conference paper
Word IV-gram probability estimation from A Japanese raw corpus
Abstract
Statistical language modeling plays an important role in a state-of-the-art speech recognizer. The most used language model (LM) is word n-gram model, which is based on the frequency of words and word sequences in a corpus. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we propose a method for building an LM directly from a raw corpus. In this method, sentences in the raw corpus are regarded as sentences annotated with stochastic word boundary information. In the experiments, we compared the predictive powers of an LM built only from a segmented coprus and an LM built from the segmented corpus and a raw corpus. The result showed that we succeeded in reducing the perplexity by 42.9% using a raw corpus by our method.