From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence

Sound is a critical part of multimodal perception. For a system — be it a voice assistant, a next-generation security monitor, or an autonomous agent — to behave naturally, it must demonstrate a full spectrum of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction.

These diverse functions rely on transforming raw sound into an intermediate representation, or embedding. But research into improving the auditory capabilities of multimodal perception models has been fragmented, and there remain important unanswered questions: How do we compare performance across domains like human speech and bioacoustics? What is the true performance potential we are leaving on the table? And could a single, general-purpose sound embedding serve as the foundation for all these capabilities?

To investigate these queries and accelerate progress toward robust machine sound intelligence, we created the Massive Sound Embedding Benchmark (MSEB), presented at NeurIPS 2025.

MSEB provides the necessary structure to answer these questions by:

Standardizing evaluation for a comprehensive suite of eight real-world capabilities that we believe every human-like intelligent system must possess.
Providing an open and extensible framework that allows researchers to seamlessly integrate and evaluate any model type — from conventional downstream uni-modal models to cascade models to end-to-end multimodal embedding models.
Establishing clear performance goals to objectively highlight research opportunities beyond current state-of-the-art approaches.

Our initial experiments confirm that current sound representations are far from universal, revealing substantial performance "headroom” (i.e., maximum improvement possible) across all eight tasks.

The three pillars of MSEB: A unified framework

MSEB is built on three foundational pillars designed to provide the community with the tools needed to build the next generation of sound understanding models.

1. Diverse datasets for real-world scenarios

A benchmark is only as strong as its data. MSEB includes a curated collection of accessible datasets that better reflect our diverse global user community. The cornerstone of our benchmark is the Simple Voice Questions (SVQ) dataset, a new resource featuring 177,352 short, spoken queries across 26 locales and 17 languages. These recordings were captured in four distinct acoustic environments (clean, background speech, traffic noise, and media noise), and include rich metadata on speaker attributes and time-aligned salient terms. We collected and open-sourced this resource, available on Hugging Face.

MSEB also integrates high-quality, public datasets covering a variety of sound domains:

Speech-MASSIVE: For multilingual spoken language understanding and intent classification.
FSD50K: A large dataset for multi-label environmental sound event recognition (200 classes from the AudioSet Ontology).
BirdSet: A massive-scale benchmark for avian bioacoustics, including complex soundscape recordings.

We’re actively working on creating and adding more relevant and large-scale datasets to MSEB. We invite the community to share their suggestions and express interest in collaboration through our GitHub repo.

2. A comprehensive suite of eight core capabilities

The design of MSEB is built on the premise that the future of AI-based sound interaction is multimodal. Every task uses sound as the critical input, but also incorporates information from other modalities (like text context or knowledge bases) to simulate realistic scenarios.

MSEB is structured around eight core “super-tasks”, i.e., tasks that represent a capability vital for an intelligent system:

Retrieval (voice search): Simulates voice search by finding relevant documents or passages in a knowledge base from a spoken query.
Reasoning (intelligent assistants): Tests the ability to find a precise answer within a given document or passage based on a spoken question.
Classification (monitoring/security): Categorizes sounds based on speaker attributes, user intent, recording environment, or specific sound events.
Transcription: Converts the audio signal into a verbatim text representation (like automatic speech recognition, or ASR, for spoken languages).
Segmentation (indexing): Identifies the most important terms within a sound clip and localizes them with precise start and end times.
Clustering (organization): Groups a collection of sound samples based on shared attributes (like speaker identity or environment) without relying on predefined labels.
Reranking (hypothesis refinement): Reorders a list of ambiguous text hypotheses (e.g., ASR output) to better match the original spoken query.
Reconstruction (generative AI): Tests the quality of the embedding by measuring the fidelity with which the original audio waveform can be regenerated from it.

Infographic titled Massive Sound Embedding Benchmark (MSEB) displaying icons for eight audio tasks, such as Retrieval, Classification, and Transcription.

MSEB tasks range from information access (retrieval, reranking, reasoning), to fundamental core perception (classification, transcription, segmentation), to higher-level organization generation (clustering, reconstruction).

Future development is focused on practical, multimodal tasks in new domains, like music or combinations with images.

3. A robust evaluation framework and headroom baselines

The primary goal of MSEB is to establish strong baselines and reveal the headroom in current AI models by evaluating them across two main task categories:

Semantic (e.g., voice search, reasoning): Do the models correctly understand the meaning and intent of the spoken words, even when the audio is noisy?
Acoustic (e.g., classification, clustering): Do the models accurately identify who is speaking or what the environmental sound is, regardless of meaning?

The model-agnostic design of the MSEB library is built to evaluate a range of models — from cascade systems to novel end-to-end audio encoders — all within a standardized, comparative framework.

Comparison methodology

We used the MSEB framework to test the performance of current sound embedding models to see how close the models are to being truly intelligent and universal.

For semantic tasks, the models were compared against the ground-truth text input. For non-semantic tasks, the models are compared against the best current dedicated solution to set a solid performance baseline that any new, general-purpose model must surpass.

Core limitations of existing sound representations

The results demonstrate that existing AI models have measurable flaws across all key sound-understanding capabilities, which demonstrates the need for an evaluation framework like MSEB.

Bar chart comparing performance metrics for Text and Sound inputs across "SuperTasks" like Retrieval, Reasoning, and Classification.

MSEB’s evaluation of AI models across key tasks shows important deficiencies and room for improvement. The metrics used for comparison include the MRR, F1, mAP, ACC, WER, NDCG, VMeasure, and FAD.

This evaluation reveals five major problems that currently limit the capability of sound-processing AI:

1. Semantic bottlenecks

For tasks relying on language content (retrieval, reasoning, reranking), the ASR stage consistently and universally bottlenecks performance, resulting in loss of semantic fidelity.

2. Misaligned objectives

The standard practice in speech technology involves a cascade model: transcribing speech to text, and then relying on that text for all downstream tasks. This is fundamentally wrong because it forces optimization onto the wrong metric. The ASR component is solely trained to minimize word error rate, a goal that is severely misaligned with the needs of real-world applications — which often requires maximizing the relevance, accuracy, or reasoning capability of the output, independent of perfect transcription.

3. Non-universality

Models exhibit a severe lack of reliability, meaning performance varies drastically by language. The systems only work well for major, common languages. When tested on less common languages, the transcription quality collapses, causing critical task failures in search, ranking, and segmentation.

4. Lack of robustness

The quality of sound reconstruction degrades sharply under noise. When background noise is introduced, the model’s ability to accurately interpret the original sound and environment struggles significantly. This establishes the most challenging benchmarks for the system, highlighting its difficulty in handling complex, general environmental sounds found in real-world settings (like a busy office or a noisy street).

5. Over-complexity

For simple tasks that don't involve understanding meaning (like identifying who is speaking), complicated, pre-trained AI models are surprisingly no better than just using the raw representation of the sound waves. This often leads developers to waste their efforts on overly complex models when basic data works just as well.

Conclusion

The results demonstrate a substantial performance gap in existing general sound-based approaches across all eight super-tasks. This widespread underperformance, relative to the maximum potential defined by these ceilings, underscores the critical need for more research into unified and robust sound representations that can close the gap in machine auditory intelligence.

We envision MSEB as a dynamic and growing platform for the entire sound processing community. We invite you to contribute to this effort by using MSEB to evaluate your own sound representation techniques, contributing new tasks and datasets to the benchmark to help it grow, and joining the collaborative effort to push the boundaries of what's possible in machine sound intelligence.

Acknowledgements

This project was led by Ehsan Variani, Georg Heigold, Tom Bagby, and Cyril Allauzen. The authors sincerely thank all who contributed to this project, whose critical input made it possible. We are especially grateful to our colleagues Hawi Abraham, Shankar Kumar, Ji Ma, Michael Riley, Sunil Vemuri, and Travis Trekel. We also wish to acknowledge those who helped prepare this post: Mark Simborg for his extensive editing, Kimberly Schwede for the wonderful illustrations, and Mickey Wurts for his valuable assistance.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence

Quick links

The three pillars of MSEB: A unified framework

1. Diverse datasets for real-world scenarios

2. A comprehensive suite of eight core capabilities

3. A robust evaluation framework and headroom baselines

Comparison methodology

Core limitations of existing sound representations

1. Semantic bottlenecks

2. Misaligned objectives

3. Non-universality

4. Lack of robustness

5. Over-complexity

Conclusion

Acknowledgements

Quick links

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence

Quick links

The three pillars of MSEB: A unified framework

1. Diverse datasets for real-world scenarios

2. A comprehensive suite of eight core capabilities

3. A robust evaluation framework and headroom baselines

Comparison methodology

Core limitations of existing sound representations

1. Semantic bottlenecks

2. Misaligned objectives

3. Non-universality

4. Lack of robustness

5. Over-complexity

Conclusion

Acknowledgements

Quick links

Other posts of interest