Channing Moore
Research Areas
Authored Publications
Sort By
Self-Supervised Learning from Automatically Separated Sound Scenes
Xavier Serra
WASPAA 2021 (2021)
Preview abstract
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
View details
The Benefit of Temporally-Strong Labels in Audio Event Classification
Caroline Liu
Proceedings of ICASSP 2021 (2021)
Preview abstract
To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (∼0.1 sec resolution) “strong” labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset’s 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet.
View details
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Proceedings of ICASSP 2020 (2020) (to appear)
Preview abstract
Humans do not acquire perceptual abilities like we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies far greater on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a novel clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate hypothesized categories into relevant semantic classes. By jointly training a single sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate
up to 20-fold reduction in labels required to reach a desired classification performance.
View details
Unsupervised Learning of Semantic Audio Representations
Ratheet Pandya
Jiayang Liu
Proceedings of ICASSP 2018 (to appear)
Preview abstract
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.
View details
Audio Set: An ontology and human-labeled dataset for audio events
Jort F. Gemmeke
Dylan Freedman
Wade Lawrence
Proc. IEEE ICASSP 2017, New Orleans, LA (to appear)
Preview abstract
Audio event recognition, the human-like ability to identify and relate sounds
from audio, is a nascent problem in machine perception. Comparable problems such
as object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of
Audio Set, a large-scale dataset of manually-annotated audio events that
endeavors to bridge the gap in data availability between image and audio
research. Using a carefully structured hierarchical ontology of 635 audio
classes guided by the literature and manual curation, we collect data from human
labelers to probe the presence of specific audio classes in 10 second segments
of YouTube videos. Segments are proposed for labeling using searches based on
metadata, context (e.g., links), and content analysis. The
result is a dataset of unprecedented breadth and size that will, we hope,
substantially stimulate the development of high-performance audio event
recognizers.
View details
CNN Architectures for Large-Scale Audio Classification
Jort F. Gemmeke
Devin Platt
Malcolm Slaney
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2017)
Preview abstract
Convolutional Neural Networks (CNNs) have proven very effective
in image classification and have shown promise for audio classification.
We apply various CNN architectures to audio and investigate
their ability to classify videos with a very large scale data set of 70M
training videos (5.24 million hours) with 30,871 labels. We examine
fully connected Deep Neural Networks (DNNs), AlexNet [1],
VGG [2], Inception [3], and ResNet [4]. We explore the effects of
training with different sized subsets of the 70M training videos. Additionally
we report the effect of training over different subsets of
the 30,871 labels. While our dataset contains video-level labels, we
are also interested in Acoustic Event Detection (AED) and train a
classifier on embeddings learned from the video-level task on AudioSet
[5]. We find that derivatives of image classification networks
do well on our audio classification task, that increasing the number
of labels we train on provides some improved performance over subsets
of labels, that performance of models improves as we increase
training set size, and that a model using embeddings learned from
the video-level task do much better than a baseline on the AudioSet
classification task.
View details
Towards Learning Semantic Audio Representations from Unlabeled Data
Ratheet Pandya
Jiayang Liu
NIPS Workshop on Machine Learning for Audio Signal Processing (ML4Audio) (2017) (to appear)
Preview abstract
Our goal is to learn semantically structured audio representations without relying on categorically labeled data. We consider several class-agnostic semantic constraints that are inherent to non-speech audio: (i) sound categories are invariant to additive noise and translations in time, (ii) mixtures of two sound events inherit the categories of the constituents, and (iii) the categories of events in close temporal proximity in a single recording are likely to be the same or related. We apply these invariants in the service of sampling training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks. The resulting low-dimensional representations provide both greatly improved query-by-example retrieval performance and reduced labeled data and model complexity requirements for supervised sound classification.
View details