Publication
CIKM 2010
Workshop paper

PTM: Probabilistic topic mapping model for mining parallel document collections

View publication

Abstract

Many applications generate a large volume of parallel document collections. A parallel document collection consists of two sets of documents where the documents in each set correspond to each other and form semantic pairs (e.g., pairs of problem and solution descriptions in a help-desk setting). Although much work has been done on text mining, little previous work has attempted to mine such a novel kind of text data. In this paper, we propose a new probabilistic topic model, called Probabilistic Topic Mapping (PTM) model, to mine parallel document collections to simultaneously discover latent topics in both sets of documents as well as the mapping of topics in one set to those in the other. We evaluate the PTM model on a parallel document collection in IT service domain. We show that PTM can effectively discover meaningful topics, as well as their mappings, and it's also useful for improving text matching and retrieval when there's a vocabulary gap. © 2010 ACM.

Date

01 Dec 2010

Publication

CIKM 2010

Authors

Share