Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval
Abstract
In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages in automatic question answering tasks. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Of the three retrieval based strategies, two are based on the semantic similarity between the actual passage and its alternatives and another one is based on the lexical overlap between them. In our experiments we train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. Negative sampling is applied in both stages. Our negative sampling is particularly useful when we augment the generic data for pre-training with synthetic examples. We evaluate our approach in three passage retrieval tasks for open-domain question answering. Even though it is not evident that there is one single sampling strategy that works best in all three tasks, it is clear that they all contribute to improving the contrast between the actual retrieval and its alternatives. Furthermore, mixing the negatives from different strategies can achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering tasks that we evaluated.