Eduardo Fonseca
I am currently a Research Scientist at Google Research, working in the Sound Understanding Group on machine learning for audio processing. Before joining Google, I received my PhD at the Music Technology Group of Universitat Pompeu Fabra in Barcelona. My PhD thesis focused on sound event classification using different types of supervision. Some of my thesis’ highlights include the Best Audio Representation Learning Paper Award at WASPAA21, and the FSD50K paper and dataset. My research explores learning algorithms for audio processing with different types of supervision, including self-supervised learning, learning with noisy labels, and multimodal learning. I have also been involved in DCASE as Challenge Task Organizer and Technical Program Co-Chair. See my personal website or my Google Scholar profile for a full list of publications.
Research Areas
Authored Publications
Sort By
Self-Supervised Learning from Automatically Separated Sound Scenes
Xavier Serra
WASPAA 2021 (2021)
Preview abstract
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
View details
The Benefit of Temporally-Strong Labels in Audio Event Classification
Caroline Liu
Proceedings of ICASSP 2021 (2021)
Preview abstract
To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (∼0.1 sec resolution) “strong” labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset’s 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet.
View details
Audio Tagging with Noisy Labels and Minimal Supervision
Frederic Font
Xavier Serra
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) (to appear)
Preview abstract
This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty in gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.
View details
Learning Sound Event Classifiers From Web Audio With Noisy Labels
Frederic Font
Xavier Favory
Xavier Serra
Proceedings of ICASSP 2019 (to appear)
Preview abstract
As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the mapping. There is, however, little research into the impact of these errors. To foster the investigation of label noise in sound event classification we present FSDnoisy18k, a dataset containing 44.2 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise empirically, and provide a CNN baseline system. Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. We also show that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
View details