To clean or not to clean: Document preprocessing and reproducibility

Dwaipayan Roy; Mandar Mitra; Debasis Ganguly

doi:10.1145/3242180

JDIQ

Paper

01 Oct 2018

To clean or not to clean: Document preprocessing and reproducibility

View publication

Abstract

Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specifc details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we fnd that including or excluding metadata in the index can produce signifcantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata fltering is found to be generally benefcial when using BM25, or language modeling with Dirichlet smoothing, but can signifcantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are signifcant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we fnd that it is generally better to remove markup before using documents for query expansion.

Conference paper