Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23) (2023)

Abstract

A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings.
One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability.
However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary.

In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems.
As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams.
We add an unsupervised pretraining task on C4
to learn internal representations for new tokens.
Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies.