Research Areas
Authored Publications
Sort By
The Benefit of Temporally-Strong Labels in Audio Event Classification
Caroline Liu
Proceedings of ICASSP 2021 (2021)
Preview abstract
To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (∼0.1 sec resolution) “strong” labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset’s 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet.
View details
Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases
Joel Shor
Jordan R. Green
Interspeech, Interspeech 2021 (2021) (to appear)
Preview abstract
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of a speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We find that the ASR encoder’s embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words.
View details
Self-Supervised Learning from Automatically Separated Sound Scenes
Xavier Serra
WASPAA 2021 (2021)
Preview abstract
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
View details
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Proceedings of ICASSP 2020 (2020) (to appear)
Preview abstract
Humans do not acquire perceptual abilities like we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies far greater on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a novel clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate hypothesized categories into relevant semantic classes. By jointly training a single sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate
up to 20-fold reduction in labels required to reach a desired classification performance.
View details
Audio Tagging with Noisy Labels and Minimal Supervision
Frederic Font
Xavier Serra
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) (to appear)
Preview abstract
This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty in gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.
View details
Learning Sound Event Classifiers From Web Audio With Noisy Labels
Frederic Font
Xavier Favory
Xavier Serra
Proceedings of ICASSP 2019 (to appear)
Preview abstract
As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the mapping. There is, however, little research into the impact of these errors. To foster the investigation of label noise in sound event classification we present FSDnoisy18k, a dataset containing 44.2 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise empirically, and provide a CNN baseline system. Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. We also show that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
View details
Unsupervised Learning of Semantic Audio Representations
Ratheet Pandya
Jiayang Liu
Proceedings of ICASSP 2018 (to appear)
Preview abstract
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.
View details
General-purpose tagging of Freesound audio with AudioSet labels: Task Description, Dataset and Baseline
Eduardo Fonseca
Frederic Font
Xavier Favory
Jordi Pons
Xavier Serra
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) (to appear)
Preview abstract
This paper describes Task 2 of the DCASE 2018 Challenge, titled ``General-purpose audio tagging of Freesound content with AudioSet labels''. This task was hosted on the Kaggle platform as ``Freesound General-Purpose Audio Tagging Challenge''. The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 heterogeneous categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.
View details
CNN Architectures for Large-Scale Audio Classification
Jort F. Gemmeke
Devin Platt
Malcolm Slaney
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2017)
Preview abstract
Convolutional Neural Networks (CNNs) have proven very effective
in image classification and have shown promise for audio classification.
We apply various CNN architectures to audio and investigate
their ability to classify videos with a very large scale data set of 70M
training videos (5.24 million hours) with 30,871 labels. We examine
fully connected Deep Neural Networks (DNNs), AlexNet [1],
VGG [2], Inception [3], and ResNet [4]. We explore the effects of
training with different sized subsets of the 70M training videos. Additionally
we report the effect of training over different subsets of
the 30,871 labels. While our dataset contains video-level labels, we
are also interested in Acoustic Event Detection (AED) and train a
classifier on embeddings learned from the video-level task on AudioSet
[5]. We find that derivatives of image classification networks
do well on our audio classification task, that increasing the number
of labels we train on provides some improved performance over subsets
of labels, that performance of models improves as we increase
training set size, and that a model using embeddings learned from
the video-level task do much better than a baseline on the AudioSet
classification task.
View details
Audio Set: An ontology and human-labeled dataset for audio events
Jort F. Gemmeke
Dylan Freedman
Wade Lawrence
Proc. IEEE ICASSP 2017, New Orleans, LA (to appear)
Preview abstract
Audio event recognition, the human-like ability to identify and relate sounds
from audio, is a nascent problem in machine perception. Comparable problems such
as object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of
Audio Set, a large-scale dataset of manually-annotated audio events that
endeavors to bridge the gap in data availability between image and audio
research. Using a carefully structured hierarchical ontology of 635 audio
classes guided by the literature and manual curation, we collect data from human
labelers to probe the presence of specific audio classes in 10 second segments
of YouTube videos. Segments are proposed for labeling using searches based on
metadata, context (e.g., links), and content analysis. The
result is a dataset of unprecedented breadth and size that will, we hope,
substantially stimulate the development of high-performance audio event
recognizers.
View details