Sound Retrieval and Ranking Using Sparse Auditory Representations

Martin Rehn
Samy Bengio
Thomas C. Walters
Gal Chechik
Neural Computation, 22 (2010), pp. 2390-2416

Abstract

To create systems that understand the sounds that humans are exposed
to in everyday life, we need to represent sounds with features that
can discriminate among many different sound classes. Here, we use a
sound-ranking framework to quantitatively evaluate such
representations in a large scale task. We have adapted a
machine-vision method, the ``passive-aggressive model for image
retrieval'' (PAMIR), which efficiently learns a linear mapping from a
very large sparse feature space to a large query-term space. Using
this approach we compare different auditory front ends and different
ways of extracting sparse features from high-dimensional auditory
images. We tested auditory models that use adaptive pole--zero filter
cascade (PZFC) auditory filterbank and sparse-code feature extraction
from stabilized auditory images via multiple vector quantizers. In
addition to auditory image models, we also compare a family of more
conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The
experimental results show a significant advantage for the auditory
models over vector-quantized MFCCs. Ranking thousands of sound files
with a query vocabulary of thousands of words, the best precision at
top-1 was 73% and the average precision was 35%, reflecting a 18%
improvement over the best competing MFCC.

Research Areas