Sound Ranking Using Auditory Sparse-Code Representations

Martin Rehn
Samy Bengio
Thomas C. Walters
Gal Chechik
ICML 2009 Workshop on Sparse Method for Music Audio
Google Scholar

Abstract

The task of ranking sounds from text queries is a
good test application for machine-hearing techniques, and particularly
for comparison and evaluation of alternative sound representations in
a large-scale setting. We have adapted a machine-vision system,
``passive-aggressive model for image retrieval''
(PAMIR), which
efficiently learns, using a ranking-based cost function, a linear
mapping from a very large sparse feature space to a large
query-term space.
Using this system allows us to focus on comparison of different
auditory front ends and different ways of extracting sparse features
from high-dimensional auditory images. In addition to two main
auditory-image models, we also include and compare a family of more
conventional MFCC front ends. The experimental results show a
significant advantage for the auditory models over vector-quantized MFCCs.
The two auditory models tested use the adaptive pole-zero filter
cascade (PZFC) auditory filterbank and sparse-code feature extraction
from stabilized auditory images via multiple vector quantizers. The
models differ in their implementation of the strobed temporal
integration used to generate the stabilized image. Using ranking
precision-at-top-k performance measures, the best results are about
70% top-1 precision and 35% average precision, using a test corpus
of thousands of sound files and a query vocabulary of hundreds of
words.

Research Areas