Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval

Jeffrey Dudek; Weize Kong; Cheng Li; Mingyang Zhang; Michael Bendersky

Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval

Jeffrey Dudek

Weize Kong

Cheng Li

Mingyang Zhang

Michael Bendersky

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23) (2023)

Download Google Scholar

Abstract

A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings.
One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability.
However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary.

In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems.
As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams.
We add an unsupervised pretraining task on C4
to learn internal representations for new tokens.
Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies.

Research Areas

Information retrieval

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs