Word IV-gram probability estimation from A Japanese raw corpus

Shinsuke Mori; Daisuke Takuma

ICSLP 2004

Conference paper

04 Oct 2004

Word IV-gram probability estimation from A Japanese raw corpus

Abstract

Statistical language modeling plays an important role in a state-of-the-art speech recognizer. The most used language model (LM) is word n-gram model, which is based on the frequency of words and word sequences in a corpus. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we propose a method for building an LM directly from a raw corpus. In this method, sentences in the raw corpus are regarded as sentences annotated with stochastic word boundary information. In the experiments, we compared the predictive powers of an LM built only from a segmented coprus and an LM built from the segmented corpus and a raw corpus. The result showed that we succeeded in reducing the perplexity by 42.9% using a raw corpus by our method.

Conference paper