Abstract

Although sound information extraction appear distinct across spectrum of sound classes and technologies, all inherently involve creating some form of "embedding"—be it discrete as in textual tokens or continuous vectors—to encapsulate relevant information from the audio signal for downstream utilization. This unifying framework allows us to re-evaluate sound information extraction by researching the optimality of current task-specific representations, the quality headroom and the potential for a single, robust sound embedding to generalize across diverse applications and sound types. To expedite research in these directions, a standardized evaluation benchmark is indispensable, mirroring the established benchmarks in text and image domains. We present the Massive Sound Embedding Benchmark (MSEB) to serve this purpose. MSEB encompasses realistic tasks and datasets that reflect practical applications across diverse technologies and sound categories. Initial experimental findings indicate substantial headroom for enhancing prevalent information extraction methodologies. We encourage the sound processing community to contribute data and tasks to MSEB and employ it to assess their algorithms for improved overall sound encoding.