Jump to Content
Richard F. Lyon

Richard F. Lyon

Dick Lyon, author of the 2017 book Human and Machine Hearing: Extracting Meaning from Sound, has a long history of research and invention, including the optical mouse, speech and handwriting recognition, computational models of hearing, and color photographic imaging. At Google he worked on Street View camera systems, and is now focused on machine hearing technology and applications.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The median of a standard gamma distribution, as a function of its shape parameter $k$, has no known representation in terms of elementary functions. In this work we prove the tightest upper and lower bounds of the form $2^{-1/k} (A + k)$: an upper bound with $A = e^{-\gamma}$ that is tight for low $k$ and a lower bound with $A = \log(2) - \frac{1}{3}$ that is tight for high $k$. These bounds are valid over the entire domain of $k > 0$, staying between 48 and 55 percentile. We derive and prove several other new tight bounds in support of the proofs. View details
    Preview abstract Understanding speech in the presence of noise with hearing aids can be challenging. Here we describe our entry, submission E003, to the 2021 Clarity Enhancement Challenge Round1 (CEC1), a machine learning challenge for improving hearing aid processing. We apply and evaluate a deep neural network speech enhancement model with a low-latency recursive least squares (RLS) adaptive beamformer, and a linear equalizer, to improve speech intelligibility in the presence of speech or noise interferers. The enhancement network is trained only on the CEC1 data, and all processing obeys the 5 ms latency requirement. We quantify the improvement using the CEC1 provided hearing loss model and Modified Binaural Short-Time Objective Intelligibility (MBSTOI) score (ranging from 0 to 1, higher being better). On the CEC1 test set, we achieve a mean of 0.644 and median of 0.652 compared to the 0.310 mean and 0.314 median for the baseline. In the CEC1 subjective listener intelligibility assessment, for scenes with noise interferers, we achieve the second highest improvement in intelligibility from 33.2% to 85.5%, but for speech interferers, we see more mixed results, potentially from listener confusion. View details
    VHP: Vibrotactile Haptics Platform for On-body Applications
    Dimitri Kanevsky
    Malcolm Slaney
    UIST, ACM, https://dl.acm.org/doi/10.1145/3472749.3474772 (2021)
    Preview abstract Wearable vibrotactile devices have many potential applications, including novel interfaces and sensory substitution for accessibility. Currently, vibrotactile experimentation is done using large lab setups. However, most practical applications require standalone on-body devices and integration into small form factors. Such integration is time-consuming and requires expertise. To democratize wearable haptics we introduce VHP, a vibrotactile haptics platform. It comprises a low-power, miniature electronics board that can drive up to 12 independent channels of haptic signals with arbitrary waveforms at 2 kHz. The platform can drive vibrotactile actuators including LRAs and voice coils. Each vibrotactile channel has current-based load sensing, thus allowing for self-testing and auto-adjustment. The hardware is battery powered, programmable, has multiple input options, including serial and Bluetooth, as well as the ability to synthesize haptic signals internally. We conduct technical evaluations to determine the power consumption, latency, and how number of actuators that can run simultaneously. We demonstrate applications where we integrate the platform into a bracelet and a sleeve to provide an audio-to-tactile wearable interface. To facilitate more use of this platform, we open-source our design and partner with a distributor to make the hardware widely available. We hope this work will motivate the use and study of vibrotactile all-day wearable devices. View details
    Preview abstract Today’s wearable and mobile devices typically use separate hardware components for sensing and actuation. In this work, we introduce new opportunities for the Linear Resonant Actuator (LRA), which is ubiquitous in such devices due to its capability for providing rich haptic feedback. By leveraging strategies to enable active and passive sensing capabilities with LRAs, we demonstrate their benefits and potential as self-contained I/O devices. Specifically, we use the back-EMF voltage to classify if the LRA is tapped, touched, as well as how much pressure is being applied. The back-EMF sensing is already integrated into many motor and LRA drivers. We developed a passive low-power tap sensing method that uses just 37.7 uA. Furthermore, we developed active touch and pressure sensing, which is low-power, quiet (2 dB), and minimizes vibration. The sensing method works with many types of LRAs. We show applications, such as pressure-sensing side-buttons on a mobile phone. We have also implemented our technique directly on an existing mobile phone’s LRA to detect if the phone is handheld or placed on a soft or hard surface. Finally, we show that this method can be used for haptic devices to determine if the LRA makes good contact with the skin. Our approach can add rich sensing capabilities to the ubiquitous LRA actuators without requiring additional sensors or hardware. View details
    Preview abstract A range of new technologies have the potential to help people, whether traditionally considered hearing impaired or not. These technologies include more sophisticated personal sound amplification products, as well as real-time speech enhancement and speech recognition. They can improve user’s communication abilities, but these new approaches require new ways to describe their success and allow engineers to optimize their properties. Speech recognition systems are often optimized using the word-error rate, but when the results are presented in real time, user interface issues become a lot more important than conventional measures of auditory performance. For example, there is a tradeoff between minimizing recognition time (latency) by quickly displaying results versus disturbing the user’s cognitive flow by rewriting the results on the screen when the recognizer later needs to change its decisions. This article describes current, new, and future directions for helping billions of people with their hearing. These new technologies bring auditory assistance to new users, especially to those in areas of the world without access to professional medical expertise. In the short term, audio enhancement technologies in inexpensive mobile forms, devices that are quickly becoming necessary to navigate all aspects of our lives, can bring better audio signals to many people. Alternatively, current speech recognition technology may obviate the need for audio amplification or enhancement at all and could be useful for listeners with normal hearing or with hearing loss. With new and dramatically better technology based on deep neural networks, speech enhancement improves the signal to noise ratio, and audio classifiers can recognize sounds in the user’s environment. Both use deep neural networks to improve a user’s experiences. Longer term, auditory attention decoding is expected to allow our devices to understand where a user is directing their attention and thus allow our devices to respond better to their needs. In all these cases, the technologies turn the hearing assistance problem on its head, and thus require new ways to measure their performance. View details
    Quadratic distortion in a nonlinear cascade model of the human cochlea
    Amin Saremi
    Journal of the Acoustical Society of America, vol. 143 (2018), EL418
    Preview abstract The cascade of asymmetric resonators with fast-acting compression (CARFAC) is a cascade filterbank model that performed well in a comparative study of cochlear models, but exhibited two anomalies in its frequency response and excitation pattern. It is shown here that the underlying reason is CARFAC's inclusion of quadratic distortion, which generates DC and low-frequency components that in a real cochlea would be canceled by reflections at the helicotrema, but since cascade filterbanks lack the reflection mechanism, these low-frequency components cause the observed anomalies. The simulations demonstrate that the anomalies disappear when the model's quadratic distortion parameter is zeroed, while other successful features of the model remain intact. View details
    Jeremy Thorpe
    Michael Chinen
    Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (2018)
    Preview abstract We explore a variety of configurations of neural networks for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on state-of-the-art performance on the CHiME2 speech enhancement task. We examine trade-offs among non-causal lookahead, compute work, and parameter count versus enhancement performance and find that zero-lookahead models can achieve, on average, only 0.5 dB worse performance than our best bidirectional model. Further, we find that 200 milliseconds of lookahead is sufficient to achieve performance within about 0.2 dB from our best bidirectional model. View details
    Preview abstract Human and Machine Hearing is the first book to comprehensively describe how human hearing works and how to build machines to analyze sounds in the same way that people do. Drawing on over thirty-five years of experience in analyzing hearing and building systems, Richard F. Lyon explains how we can now build machines with close-to-human abilities in speech, music, and other sound-understanding domains. He explains human hearing in terms of engineering concepts, and describes how to incorporate those concepts into machines for a wide range of modern applications. The details of this approach are presented at an accessible level, to bring a diverse range of readers, from neuroscience to engineering, to a common technical understanding. The description of hearing as signal-processing algorithms is supported by corresponding open-source code, for which the book serves as motivating documentation. View details
    Preview abstract Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost. View details
    A 6 µW per Channel Analog Biomimetic Cochlear Implant Processor Filterbank Architecture With Across Channels AGC
    Guang Wang
    Emmanuel M. Drakakis
    IEEE Transactions on Biomedical Circuits and Systems, vol. 9 (2015), pp. 72-86
    Preview abstract A new analog cochlear implant processor filterbank architecture of increased biofidelity, enhanced across-channel contrast and very low power consumption has been designed and prototyped. Each channel implements a biomimetic, asymmetric bandpass-like One-Zero-Gammatone-Filter (OZGF) transfer function, using class-AB log-domain techniques. Each channel's quality factor and suppression are controlled by means of a new low power Automatic Gain Control (AGC) scheme which is coupled across the neighboring channels and emulates lateral inhibition (LI) phenomena in the auditory system. Detailed measurements from a five-channel silicon IC prototype fabricated in a 0.35 µm AMS technology confirm the operation of the coupled AGC scheme and its ability to enhance contrast among channel outputs. The prototype is characterized by an input dynamic range of 92 dB while consuming only 28 µW of power in total ~6 µW per channel) under a 1.8 V power supply. The architecture is well-suited for fully-implantable cochlear implants. View details
    The Optical Mouse: Early Biomimetic Embedded Vision
    Advnances in Embedded Computer Vision, Springer (2014), pp. 3-22
    Preview abstract The 1980 Xerox optical mouse invention, and subsequent product, was a successful deployment of embedded vision, as well as of the Mead–Conway VLSI design methodology that we developed at Xerox PARC in the late 1970s. The design incorporated an interpretation of visual lateral inhibition, essentially mimicking biology to achieve a wide dynamic range, or light-level-independent operation. Conceived in the context of a research group developing VLSI design methodologies, the optical mouse chip represented an approach to self-timed semi-digital design, with the analog image-sensing nodes connecting directly to otherwise digital logic using a switch-network methodology. Using only a few hundred gates and pass transistors in 5-micron nMOS technology, the optical mouse chip tracked the motion of light dots in its field of view, and reported motion with a pair of 2-bit Gray codes for x and y relative position—just like the mechanical mice of the time. Besides the chip, the only other electronic components in the mouse were the LED illuminators. View details
    The Intervalgram: An Audio Feature for Large-Scale Cover-Song Recognition
    Thomas C. Walters
    From Sounds to Music and Emotions: 9th International Symposium, CMMR 2012, London, UK, June 19-22, 2012, Revised Selected Papers, Springer Berlin Heidelberg (2013), pp. 197-213
    Preview abstract We present a system for representing the musical content of short pieces of audio using a novel chroma-based representation known as the ‘intervalgram’, which is a summary of the local pattern of musical intervals in a segment of music. The intervalgram is based on a chroma representation derived from the temporal profile of the stabilized auditory image [10] and is made locally pitch invariant by means of a ‘soft’ pitch transposition to a local reference. Intervalgrams are generated for a piece of music using multiple overlapping windows. These sets of intervalgrams are used as the basis of a system for detection of identical melodic and harmonic progressions in a database of music. Using a dynamic-programming approach for comparisons between a reference and the song database, performance is evaluated on the ‘covers80’ dataset [4]. A first test of an intervalgram-based system on this dataset yields a precision at top-1 of 53.8%, with an ROC curve that shows very high precision up to moderate recall, suggesting that the intervalgram is adept at identifying the easier-to-match cover songs in the dataset with high robustness. The intervalgram is designed to support locality-sensitive hashing, such that an index lookup from each single intervalgram feature has a moderate probability of retrieving a match, with few false matches. With this indexing approach, a large reference database can be quickly pruned before more detailed matching, as in previous content-identification systems. View details
    Modelling the Distortion Produced by Cochlear Compression
    Roy D. Patterson
    Timothy Ives
    Thomas C. Walters
    Basic Aspects of Hearing, Springer (2013), pp. 81-88
    Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos
    Eric Nichols
    Charles DuHadway
    Proceedings of the 2012 IEEE 12th International Conference on Data Mining (ICDM), IEEE Computer Society, Washington, DC, USA, pp. 559-565
    Preview abstract Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality. View details
    Preview abstract A cascade of two-pole–two-zero filters with level-dependent pole and zero dampings, with few parameters, can provide a good match to human psychophysical and physiological data. The model has been fitted to data on detection threshold for tones in notched-noise masking, including bandwidth and filter shape changes over a wide range of levels, and has been shown to provide better fits with fewer parameters compared to other auditory filter models such as gammachirps. Originally motivated as an efficient machine implementation of auditory filtering related to the WKB analysis method of cochlear wave propagation, such filter cascades also provide good fits to mechanical basilar membrane data, and to auditory nerve data, including linear low-frequency tail response, level-dependent peak gain, sharp tuning curves, nonlinear compression curves, level-independent zero-crossing times in the impulse response, realistic instantaneous frequency glides, and appropriate level-dependent group delay even with minimum-phase response. As part of exploring different level-dependent parameterizations of such filter cascades, we have identified a simple sufficient condition for stable zero-crossing times, based on the shifting property of the Laplace transform: simply move all the $s$-domain poles and zeros by equal amounts in the real-$s$ direction. Such pole-zero filter cascades are efficient front ends for machine hearing applications, such as music information retrieval, content identification, speech recognition, and sound indexing. View details
    Preview abstract A key problem in using the output of an auditory model as the input to a machine-learning system in a machine-hearing application is to find a good feature-extraction layer. For systems such as PAMIR (passive-aggressive model for image retrieval) that work well with a large sparse feature vector, a conversion from auditory images to sparse features is needed. For audio-file ranking and retrieval from text queries, based on stabilized auditory images, we took a multi-scale approach, using vector quantization to choose one sparse feature in each of many overlapping regions of different scales, with the hope that in some regions the features for a sound would be stable even when other interfering sounds were present and affecting other regions. We recently extended our testing of this approach using sound mixtures, and found that the sparse-coded auditory-image features degrade less in interference than vector-quantized MFCC sparse features do. This initial success suggests that our hope of robustness in interference may indeed be realizable, via the general idea of sparse features that are localized in a domain where signal components tend to be localized or stable. View details
    Preview abstract Every day, machines process many thousands of hours of audio signals through a realistic cochlear model. They extract features, inform classifiers and recommenders, and identify copyrighted material. The machine-hearing approach to such tasks has taken root in recent years, because hearing-based approaches perform better than we can do with more conventional sound-analysis approaches. We use a bio-mimetic "cascade of asymmetric resonators with fast-acting compression" (CAR-FAC)—an efficient sound analyzer that incorporates the hearing research community's findings on nonlinear auditory filter models and cochlear wave mechanics. The CAR-FAC is based on a pole–zero filter cascade (PZFC) model of auditory filtering, in combination with a multi-time-scale coupled automatic-gain-control (AGC) network. It uses simple nonlinear extensions of conventional digital filter stages, and runs fast due to its low complexity. The PZFC plus AGC network, the CAR-FAC, mimics features of auditory physiology, such as masking, compressive traveling-wave response, and the stability of zero-crossing times with signal level. Its output "neural activity pattern" is converted to a "stabilized auditory image" to capture pitch, melody, and other temporal and spectral features of the sound. View details
    Auditory Sparse Coding
    Steven R. Ness
    Thomas Walters
    Music Data Mining, CRC Press/Chapman Hall (2011)
    Preview abstract The concept of sparsity has attracted considerable interest in the field of machine learning in the past few years. Sparse feature vectors contain mostly values of zero and one or a few non-zero values. Although these feature vectors can be classified by traditional machine learning algorithms, such as SVM, there are various recently-developed algorithms that explicitly take advantage of the sparse nature of the data, leading to massive speedups in time, as well as improved performance. Some fields that have benefited from the use of sparse algorithms are finance, bioinformatics, text mining, and image classification. Because of their speed, these algorithms perform well on very large collections of data; large collections are becoming increasingly relevant given the huge amounts of data collected and warehoused by Internet businesses. We discuss the application of sparse feature vectors in the field of audio analysis, and specifically their use in conjunction with preprocessing systems that model the human auditory system. We present results that demonstrate the applicability of the combination of auditory-based processing and sparse coding to content-based audio analysis tasks: a search task in which ranked lists of sound effects are retrieved from text queries, and a music information retrieval (MIR) task dealing with the classification of music into genres. View details
    Preview abstract A cascade of two-pole–two-zero filter stages is a good model of the auditory periphery in two distinct ways. First, in the form of the pole–zero filter cascade, it acts as an auditory filter model that provides an excellent fit to data on human detection of tones in masking noise, with fewer fitting parameters than previously reported filter models such as the roex and gammachirp models. Second, when extended to the form of the cascade of asymmetric resonators with fast-acting compression, it serves as an efficient front-end filterbank for machine-hearing applications, including dynamic nonlinear effects such as fast wide-dynamic-range compression. In their underlying linear approximations, these filters are described by their poles and zeros, that is, by rational transfer functions, which makes them simple to implement in analog or digital domains. Other advantages in these models derive from the close connection of the filter-cascade architecture to wave propagation in the cochlea. These models also reflect the automatic-gain-control function of the auditory system and can maintain approximately constant impulse-response zero-crossing times as the level-dependent parameters change. Copyright (2011) Acoustical Society of America. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the Acoustical Society of America. The article appeared in J. Acoust. Soc. Am. vol. 130 and may be found via http://asadl.org/jasa/resource/1/jasman/v130/i6/p3893_s1. View details
    History and Future of Auditory Filter Models
    Andreas G. Katsiamis
    Emmanuel M. Drakakis
    Proc. ISCAS, IEEE (2010), pp. 3809-3812
    Preview abstract Auditory filter models have a history of over a hundred years, with explicit bio-mimetic inspiration at many stages along the way. From passive analogue electric delay line models, through digital filter models, active analogue VLSI models, and abstract filter shape models, these filters have both represented and driven the state of progress in auditory research. Today, we are able to represent a wide range of linear and nonlinear aspects of the psychophysics and physiology of hearing with a rather simple and elegant set of circuits or computations that have a clear connection to underlying hydrodynamics and with parameters calibrated to human performance data. A key part of the progress in getting to this stage has been the experimental clarification of the nature of cochlear nonlinearities, and the modelling work to map these experimental results into the domain of circuits and systems. No matter how these models are built into machine-hearing systems, their bio-mimetic roots will remain key to their performance. In this paper we review some of these models, explain their advantages and disadvantages and present possible ways of implementing them. As an example, a continuous-time analogue CMOS implementation of the One Zero Gammatone Filter (OZGF) is presented together with its automatic gain control that models its level-dependent nonlinear behaviour. View details
    Google Street View: Capturing the World at Street Level
    Dragomir Anguelov
    Carole Dulong
    Daniel Filip
    Christian Frueh
    Abhijit Ogale
    Luc Vincent
    Josh Weaver
    Computer, vol. 43 (2010)
    Preview abstract Street View serves millions of Google users daily with panoramic imagery captured in hundreds of cities in 20 countries across four continents. A team of Google researchers describes the technical challenges involved in capturing, processing, and serving street-level imagery on a global scale. View details
    Sound Retrieval and Ranking Using Sparse Auditory Representations
    Martin Rehn
    Samy Bengio
    Thomas C. Walters
    Gal Chechik
    Neural Computation, vol. 22 (2010), pp. 2390-2416
    Preview abstract To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machine-vision method, the ``passive-aggressive model for image retrieval'' (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use adaptive pole--zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC. View details
    Machine Hearing: An Emerging Field
    IEEE Signal Processing Magazine, vol. 27 (2010), pp. 131-139
    Preview abstract (intro paragraph in lieu of abstract) If we had machines that could hear as humans do, we would expect them to be able to easily distinguish speech from music and background noises, to pull out the speech and music parts for special treatment, to know what direction sounds are coming from, to learn which noises are typical and which are noteworthy. Hearing machines should be able to organize what they hear; learn names for recognizable objects, actions, events, places, musical styles, instruments, and speakers; and retrieve sounds by reference to those names. These machines should be able to listen and react in real time, to take appropriate action on hearing noteworthy events, to participate in ongoing activities, whether in factories, in musical performances, or in phone conversations. View details
    A Biomimetic, 4.5 µW, 120+dB, Log-domain Cochlea Channel with AGC
    Andreas G. Katsiamis
    Emmanuel M. Drakakis
    IEEE JSSC (Journal of Solid-State Circuits), vol. 44 (2009), pp. 1006-1022
    Preview abstract This paper deals with the design and performance evaluation of a new analog CMOS cochlea channel of increased biorealism. The design implements a recently proposed transfer function, namely the One-Zero Gammatone filter (or OZGF), which provides a robust foundation for modeling a variety of auditory data such as realistic passband asymmetry, linear low-frequency tail and level-dependent gain. Moreover, the OZGF is attractive because it can be implemented efficiently in any technological medium-analog or digital-using standard building blocks. The channel was synthesized using novel, low-power, class-AB, log-domain, biquadratic filters employing MOS transistors operating in their weak inversion regime. Furthermore, the paper details the design of a new low-power automatic gain control circuit that adapts the gain of the channel according to the input signal strength, thereby extending significantly its input dynamic range. We evaluate the performance of a fourth-order OZGF channel (equivalent to an 8th-order cascaded filter structure) through both detailed simulations and measurements from a fabricated chip using the commercially available 0.35 mum AMS CMOS process. The whole system is tuned at 3 kHz, dissipates a mere 4.46 µW of static power, accommodates 124 dB (at < 5% THD) of input dynamic range at the center frequency and is set to provide up to 70 dB of amplification for small signals. View details
    Sound Ranking Using Auditory Sparse-Code Representations
    Martin Rehn
    Samy Bengio
    Thomas C. Walters
    Gal Chechik
    ICML 2009 Workshop on Sparse Method for Music Audio
    Preview abstract The task of ranking sounds from text queries is a good test application for machine-hearing techniques, and particularly for comparison and evaluation of alternative sound representations in a large-scale setting. We have adapted a machine-vision system, ``passive-aggressive model for image retrieval'' (PAMIR), which efficiently learns, using a ranking-based cost function, a linear mapping from a very large sparse feature space to a large query-term space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. In addition to two main auditory-image models, we also include and compare a family of more conventional MFCC front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. The two auditory models tested use the adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precision-at-top-k performance measures, the best results are about 70% top-1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words. View details
    Large Scale Content-Based Audio Retrieval from Text Queries
    Gal Chechik
    Martin Rehn
    Samy Bengio
    ACM International Conference on Multimedia Information Retrieval (MIR), ACM (2008)
    Preview abstract In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future. View details
    Practical Gammatone-Like Filters for Auditory Modeling
    Andreas G. Katsiamis
    Emmanuel M. Drakakis
    EURASIP Journal on Audio, Speech, and Music Processing, vol. 2007 (2007), pp. 12
    Preview abstract This paper deals with continuous-time filter transfer functions that resemble tuning curves at particular set of places on the basilar membrane of the biological cochlea and that are suitable for practical VLSI implementations. The resulting filters can be used in a filterbank architecture to realize cochlea implants or auditory processors of increased biorealism. To put the reader into context, the paper starts with a short review on the gammatone filter and then exposes two of its variants, namely, the differentiated all-pole gammatone filter (DAPGF) and one-zero gammatone filter (OZGF), filter responses that provide a robust foundation for modeling cochlea transfer functions. The DAPGF and OZGF responses are attractive because they exhibit certain characteristics suitable for modeling a variety of auditory data: level-dependent gain, linear tail for frequencies well below the center frequency, asymmetry, and so forth. In addition, their form suggests their implementation by means of cascades of N identical two-pole systems which render them as excellent candidates for efficient analog or digital VLSI realizations. We provide results that shed light on their char- acteristics and attributes and which can also serve as “design curves” for fitting these responses to frequency-domain physiological data. The DAPGF and OZGF responses are essentially a “missing link” between physiological, electrical, and mechanical models for auditory filtering. View details
    No Results Found