December 3, 2025
Ehsan Variani, Senior Staff Research Scientist, Google Research
The Massive Sound Embedding Benchmark (MSEB) is the definitive, open-source platform for measuring machine sound intelligence, unifying eight core capabilities — from retrieval and classification to reconstruction — to drive research past the current performance ceiling of sound-based AI.
Sound is a critical part of multimodal perception. For a system — be it a voice assistant, a next-generation security monitor, or an autonomous agent — to behave naturally, it must demonstrate a full spectrum of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction.
These diverse functions rely on transforming raw sound into an intermediate representation, or embedding. But research into improving the auditory capabilities of multimodal perception models has been fragmented, and there remain important unanswered questions: How do we compare performance across domains like human speech and bioacoustics? What is the true performance potential we are leaving on the table? And could a single, general-purpose sound embedding serve as the foundation for all these capabilities?
To investigate these queries and accelerate progress toward robust machine sound intelligence, we created the Massive Sound Embedding Benchmark (MSEB), presented at NeurIPS 2025.
MSEB provides the necessary structure to answer these questions by:
Our initial experiments confirm that current sound representations are far from universal, revealing substantial performance "headroom” (i.e., maximum improvement possible) across all eight tasks.
MSEB is built on three foundational pillars designed to provide the community with the tools needed to build the next generation of sound understanding models.
A benchmark is only as strong as its data. MSEB includes a curated collection of accessible datasets that better reflect our diverse global user community. The cornerstone of our benchmark is the Simple Voice Questions (SVQ) dataset, a new resource featuring 177,352 short, spoken queries across 26 locales and 17 languages. These recordings were captured in four distinct acoustic environments (clean, background speech, traffic noise, and media noise), and include rich metadata on speaker attributes and time-aligned salient terms. We collected and open-sourced this resource, available on Hugging Face.
MSEB also integrates high-quality, public datasets covering a variety of sound domains:
We’re actively working on creating and adding more relevant and large-scale datasets to MSEB. We invite the community to share their suggestions and express interest in collaboration through our GitHub repo.
The design of MSEB is built on the premise that the future of AI-based sound interaction is multimodal. Every task uses sound as the critical input, but also incorporates information from other modalities (like text context or knowledge bases) to simulate realistic scenarios.
MSEB is structured around eight core “super-tasks”, i.e., tasks that represent a capability vital for an intelligent system:
MSEB tasks range from information access (retrieval, reranking, reasoning), to fundamental core perception (classification, transcription, segmentation), to higher-level organization generation (clustering, reconstruction).
Future development is focused on practical, multimodal tasks in new domains, like music or combinations with images.
The primary goal of MSEB is to establish strong baselines and reveal the headroom in current AI models by evaluating them across two main task categories:
The model-agnostic design of the MSEB library is built to evaluate a range of models — from cascade systems to novel end-to-end audio encoders — all within a standardized, comparative framework.
We used the MSEB framework to test the performance of current sound embedding models to see how close the models are to being truly intelligent and universal.
For semantic tasks, the models were compared against the ground-truth text input. For non-semantic tasks, the models are compared against the best current dedicated solution to set a solid performance baseline that any new, general-purpose model must surpass.
The results demonstrate that existing AI models have measurable flaws across all key sound-understanding capabilities, which demonstrates the need for an evaluation framework like MSEB.
This evaluation reveals five major problems that currently limit the capability of sound-processing AI:
For tasks relying on language content (retrieval, reasoning, reranking), the ASR stage consistently and universally bottlenecks performance, resulting in loss of semantic fidelity.
The standard practice in speech technology involves a cascade model: transcribing speech to text, and then relying on that text for all downstream tasks. This is fundamentally wrong because it forces optimization onto the wrong metric. The ASR component is solely trained to minimize word error rate, a goal that is severely misaligned with the needs of real-world applications — which often requires maximizing the relevance, accuracy, or reasoning capability of the output, independent of perfect transcription.
Models exhibit a severe lack of reliability, meaning performance varies drastically by language. The systems only work well for major, common languages. When tested on less common languages, the transcription quality collapses, causing critical task failures in search, ranking, and segmentation.
The quality of sound reconstruction degrades sharply under noise. When background noise is introduced, the model’s ability to accurately interpret the original sound and environment struggles significantly. This establishes the most challenging benchmarks for the system, highlighting its difficulty in handling complex, general environmental sounds found in real-world settings (like a busy office or a noisy street).
For simple tasks that don't involve understanding meaning (like identifying who is speaking), complicated, pre-trained AI models are surprisingly no better than just using the raw representation of the sound waves. This often leads developers to waste their efforts on overly complex models when basic data works just as well.
The results demonstrate a substantial performance gap in existing general sound-based approaches across all eight super-tasks. This widespread underperformance, relative to the maximum potential defined by these ceilings, underscores the critical need for more research into unified and robust sound representations that can close the gap in machine auditory intelligence.
We envision MSEB as a dynamic and growing platform for the entire sound processing community. We invite you to contribute to this effort by using MSEB to evaluate your own sound representation techniques, contributing new tasks and datasets to the benchmark to help it grow, and joining the collaborative effort to push the boundaries of what's possible in machine sound intelligence.
This project was led by Ehsan Variani, Georg Heigold, Tom Bagby, and Cyril Allauzen. The authors sincerely thank all who contributed to this project, whose critical input made it possible. We are especially grateful to our colleagues Hawi Abraham, Shankar Kumar, Ji Ma, Michael Riley, Sunil Vemuri, and Travis Trekel. We also wish to acknowledge those who helped prepare this post: Mark Simborg for his extensive editing, Kimberly Schwede for the wonderful illustrations, and Mickey Wurts for his valuable assistance.