About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
JDIQ
Paper
To clean or not to clean: Document preprocessing and reproducibility
Abstract
Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specifc details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we fnd that including or excluding metadata in the index can produce signifcantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata fltering is found to be generally benefcial when using BM25, or language modeling with Dirichlet smoothing, but can signifcantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are signifcant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we fnd that it is generally better to remove markup before using documents for query expansion.