Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning

Revanth Gangi Reddy; Vikas Yadav; Md Arafat Sultan; Martin Franz; Vittorio Castelli; Heng Ji; Avirup Sil

COLING 2022

Conference paper

12 Oct 2022

Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning

Abstract

Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR)—a popular choice for neural IR—through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator1, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.

Conference paper