Mining bilingual topic hierarchies from unaligned text
Abstract
Recent years have seen an exponential growth in the amount of multilingual text available on the web. This situation raises the need for novel applications for organizing and accessing multilingual content. Common examples of such applications include Multilingual Topic Tracking, Cross-Language Information retrieval systems etc. Most of these applications rely on the availability of multilingual lexical resources which require significant effort to create. In this paper we present an unsupervised method for building bilingual topic hierarchies. In a bilingual topic hierarchy, topics (where a topic is a distribution over words) are arranged in a hierarchical fashion with abstract topics appearing near the root of the hierarchy and more concrete topics near the leaves. Such bilingual topic hierarchies can be useful for organizing bilingual corpus based on common topics, cross-lingual information retrieval and cross-lingual text classification. Our method builds upon the prior work done on Bayesian non-parametric inferencing of topic hierarchies and multilingual topic modeling to extract bilingual topic hierarchies from unaligned text. We demonstrate the effectiveness of our algorithm in extracting such topic hierarchies from a collection of bilingual text passages and FAQs.