MUVERA: Making multi-vector retrieval as fast as single-vector search

Neural embedding models have become a cornerstone of modern information retrieval (IR). Given a query from a user (e.g., “How tall is Mt Everest?”), the goal of IR is to find information relevant to the query from a very large collection of data (e.g., the billions of documents, images, or videos on the Web). Embedding models transform each datapoint into a single-vector “embedding”, such that semantically similar datapoints are transformed into mathematically similar vectors. The embeddings are generally compared via the inner-product similarity, enabling efficient retrieval through optimized maximum inner product search (MIPS) algorithms. However, recent advances, particularly the introduction of multi-vector models like ColBERT, have demonstrated significantly improved performance in IR tasks.

Unlike single-vector embeddings, multi-vector models represent each data point with a set of embeddings, and leverage more sophisticated similarity functions that can capture richer relationships between datapoints. For example, the popular Chamfer similarity measure used in state-of-the-art multi-vector models captures when the information in one multi-vector embedding is contained within another multi-vector embedding. While this multi-vector approach boosts accuracy and enables retrieving more relevant documents, it introduces substantial computational challenges. In particular, the increased number of embeddings and the complexity of multi-vector similarity scoring make retrieval significantly more expensive.

In “MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings”, we introduce a novel multi-vector retrieval algorithm designed to bridge the efficiency gap between single- and multi-vector retrieval. We transform multi-vector retrieval into a simpler problem by constructing fixed dimensional encodings (FDEs) of queries and documents, which are single vectors whose inner product approximates multi-vector similarity, thus reducing complex multi-vector retrieval back to single-vector maximum inner product search (MIPS). This new approach allows us to leverage the highly-optimized MIPS algorithms to retrieve an initial set of candidates that can then be re-ranked with the exact multi-vector similarity, thereby enabling efficient multi-vector retrieval without sacrificing accuracy. We have provided an open-source implementation of our FDE construction algorithm on GitHub.

The challenge of multi-vector retrieval

Multi-vector models generate multiple embeddings per query or document, often one embedding per token. One typically calculates the similarity between a query and a document using Chamfer matching, which measures the maximum similarity between each query embedding and the closest document embedding, and then adds these similarities up across all query vectors (the standard method of computing multi-vector similarity). The Chamfer similarity, therefore, provides a "holistic" measure of how each part of the query relates to some part of the document.

While multi-vector representations offer advantages like improved interpretability and generalization, they pose significant retrieval challenges:

Increased embedding volume: Generating embeddings per token drastically increases the number of embeddings to be processed.
Complex and compute-intensive similarity scoring: Chamfer matching is a non-linear operation requiring a matrix product, which is more expensive than a single vector dot-product.
Lack of efficient sublinear search methods: Single-vector retrieval benefits from highly optimized algorithms (e.g., based on space partitioning) that simultaneously achieve high accuracy and sublinear search times, avoiding exhaustive comparisons. The complex nature of multi-vector similarity prevents the direct application of these fast geometric techniques, hindering efficient retrieval at scale.

Unfortunately, traditional single-vector MIPS algorithms cannot be directly applied to multi-vector retrieval — for example, a document might have a token with high similarity to a single query token, but overall, the document might not be very relevant. This problem necessitates more complex and computationally intensive retrieval methods.

MUVERA: A solution with fixed dimensional encodings

MUVERA offers an elegant solution by reducing multi-vector similarity search to single-vector MIPS to make retrieval over complex multi-vector data much faster. Imagine you have a large dataset of "multi-vector sets" (i.e., sets of vectors) where each set describes some datapoint, but searching through each of these sets is slow. MUVERA's trick is to take that whole group of multi-vectors and squeeze them into a single, easier-to-handle vector that we call a fixed dimensional encoding (FDE). A key part is that if you compare these simplified FDEs, their comparison closely matches what you'd get if you compared the original, more complex multi-vector sets. This lets us use much quicker search methods designed for single vectors.

Here's a simplified breakdown of how MUVERA works:

FDE generation: MUVERA employs mappings to convert query and document multi-vector sets into FDEs. These mappings are designed to capture the essential similarity information in a fixed-length vector.
MIPS-based retrieval: The FDEs of documents are indexed using a standard MIPS solver. Given a query, its FDE is computed, and the MIPS solver efficiently retrieves the most similar document FDEs.
Re-ranking: The initial candidates retrieved by MIPS are re-ranked using the original Chamfer similarity for improved accuracy.

A key advantage of MUVERA is that the FDE transformation is data-oblivious. This means it doesn't depend on the specific dataset, making it both robust to changes in data distribution and suitable for streaming applications. Additionally, unlike single-vectors produced by a model, FDE’s are guaranteed to approximate the true Chamfer similarity to within a specified error. Thus, after the re-ranking stage, MUVERA is guaranteed to find the most similar multi-vector representations.

Illustration of the construction of query FDE's. Each token (shown as a word in this example) is mapped to a high-dimensional vector (2-D in the example for simplicity). The high-dimensional space is randomly partitioned by hyperplane cuts. Each piece of space is assigned a block of coordinates in the output FDE, which is set to the sum of the coordinates of the query vectors that land in that piece.

Illustration of the construction of document FDE's. The construction is the same as the query construction, except that the vectors falling in a given piece of the partitioned space are averaged together instead of summed, which accurately captures the asymmetric nature of the Chamfer similarity.

Theoretical foundations

Our approach is inspired by techniques used in probabilistic tree embeddings, a powerful tool in the theory of geometric algorithms. However, we adapt these techniques to work with inner products and Chamfer similarity.

The core idea behind FDE generation is to partition the embedding space into sections (illustrated in the figure above). If similar vectors from a query and a document fall into the same section, we can approximate their similarity efficiently. However, since we don't know the optimal matching between query and document vectors beforehand, we use a randomized partitioning scheme.

We also provide theoretical guarantees for MUVERA, proving that FDEs offer a strong approximation of Chamfer similarity (you can read more in the paper). This is a significant result, as it provides a principled way to perform multi-vector retrieval using single-vector proxies with provable accuracy.

Experimental results

We evaluated MUVERA on several information retrieval datasets from the BEIR benchmarks. Our experiments demonstrate that MUVERA consistently achieves high retrieval accuracy with significantly reduced latency compared to the previous state-of-the-art method known as PLAID.

Our key findings include:

Improved recall: MUVERA outperforms the single-vector heuristic, a common approach used in multi-vector retrieval (which PLAID also employs), achieving better recall while retrieving significantly fewer candidate documents (shown in the figure below). For instance, FDE’s retrieve 5–20x fewer candidates to achieve a fixed recall.

Recall of fixed dimensional encodings (FDE) of varying dimensions vs. a single-vector heuristic (SV). Note 10240-dimensional FDE’s have nearly the same representation size as the original MV representation (used in SV heuristic), while requiring significantly fewer comparisons in the search (true even for 20k-dimensional FDE’s).

Reduced latency: Compared to PLAID, a highly optimized multi-vector retrieval system based on the single-vector heuristic, MUVERA achieves an average of 10% higher recall with a remarkable 90% reduction in latency across the BEIR datasets (shown in the figure below).

MUVERA vs. PLAID over BEIR benchmarks.

Moreover, we found that MUVERA's FDEs can be effectively compressed using product quantization, reducing memory footprint by 32x with minimal impact on retrieval quality.

These results highlight MUVERA's potential to significantly accelerate multi-vector retrieval, making it more practical for real-world applications.

Conclusion

We have presented MUVERA, a novel and efficient multi-vector retrieval algorithm with provable guarantees on its approximation quality and good practical performance. By reducing multi-vector search to single-vector MIPS, MUVERA leverages existing optimized search techniques and achieves state-of-the-art performance with significantly improved efficiency. Interested readers can find an open-source implementation of our FDE construction algorithm on GitHub.

Our work opens up new avenues for efficient multi-vector retrieval, which is crucial for various applications, including search engines, recommendation systems, and natural language processing. We believe that further research and optimization of MUVERA will lead to even greater performance gains and broader adoption of multi-vector retrieval techniques.

Acknowledgements

The work summarized in this blog post was done in collaboration with Majid Hadian, Jason Lee, and Vahab Mirrokni. Lastly, we thank Kimberly Schwede for their valuable help with making the animation in this blog post.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

MUVERA: Making multi-vector retrieval as fast as single-vector search

Quick links

The challenge of multi-vector retrieval

MUVERA: A solution with fixed dimensional encodings

Theoretical foundations

Experimental results

Conclusion

Acknowledgements

Quick links

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

MUVERA: Making multi-vector retrieval as fast as single-vector search

Quick links

The challenge of multi-vector retrieval

MUVERA: A solution with fixed dimensional encodings

Theoretical foundations

Experimental results

Conclusion

Acknowledgements

Quick links

Other posts of interest