Probabilistic Modeling for Information Retrieval with Unsupervised Training Data

Ernest P. Chan; Santiago Garcia; Salim Roukos

KDD 1998

Conference paper

27 Aug 1998

Probabilistic Modeling for Information Retrieval with Unsupervised Training Data

Abstract

We apply a well-known Bayesian probabilistic model to textual information retrieval: the classification of documents based on their relevance to a query. This model was previously used with supervised training data for a fixed query. When only noisy, unsupervised training data generated from a heuristic relevance-scoring formula are available, two crucial adaptations are needed: (1) severe smoothing of the models built on the training data; and (2) adding a prior probability to the models. We have shown that with these adaptations, the probabilistic model is able to improve the retrieval precision of the heuristic model. The experiment was performed using the TREC-5 corpus and queries, and the evaluation of the model was submitted as an official entry (ibms96b) to TREC-5.

Conference paper