Using word embeddings for information retrieval: How collection and term normalization choices affect performance

Dwaipayan Roy; Debasis Ganguly; Sumit Bhatia; Srikanta Bedathur; Mandar Mitra

doi:10.1145/3269206.3269277

CIKM 2018

Conference paper

17 Oct 2018

Using word embeddings for information retrieval: How collection and term normalization choices affect performance

View publication

Abstract

Neural word embedding approaches, due to their ability to capture semantic meanings of vocabulary terms, have recently gained attention of the information retrieval (IR) community and have shown promising results in improving ad hoc retrieval performance. It has been observed that these approaches are sensitive to various choices made during the learning of word embeddings and their usage, often leading to poor reproducibility. We study the effect of varying following two parameters, viz., i) the term normalization and ii) the choice of training collection, on ad hoc retrieval performance with word2vec and fastText embeddings. We present quantitative estimates of similarity of word vectors obtained under different settings, and use embeddings based query expansion task to understand the effects of these parameters on IR effectiveness.

Conference paper