Jeffrey M Dudek
Jeffrey Dudek is a software engineer at Google Research working on Natural Language Processing and Information Retrieval. Before joining Google, he received his Ph.D. at Rice University. His full list of publications can be found in Google Scholar.
Authored Publications
Sort By
SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), ACM (2023) (to appear)
Preview abstract
In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed – a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness.
View details
Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval
Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23) (2023)
Preview abstract
A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings.
One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability.
However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary.
In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems.
As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams.
We add an unsupervised pretraining task on C4
to learn internal representations for new tokens.
Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies.
View details