Towards Learning Semantic Audio Representations from Unlabeled Data

Aren Jansen; Manoj Plakal; Ratheet Pandya; Dan Ellis; Shawn Hershey; Jiayang Liu; Channing Moore; Rif A. Saurous

Towards Learning Semantic Audio Representations from Unlabeled Data

Aren Jansen

Manoj Plakal

Ratheet Pandya

Dan Ellis

Shawn Hershey

Jiayang Liu

Channing Moore

Rif A. Saurous

NIPS Workshop on Machine Learning for Audio Signal Processing (ML4Audio) (2017) (to appear)

Download Google Scholar

Abstract

Our goal is to learn semantically structured audio representations without relying on categorically labeled data. We consider several class-agnostic semantic constraints that are inherent to non-speech audio: (i) sound categories are invariant to additive noise and translations in time, (ii) mixtures of two sound events inherit the categories of the constituents, and (iii) the categories of events in close temporal proximity in a single recording are likely to be the same or related. We apply these invariants in the service of sampling training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks. The resulting low-dimensional representations provide both greatly improved query-by-example retrieval performance and reduced labeled data and model complexity requirements for supervised sound classification.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Towards Learning Semantic Audio Representations from Unlabeled Data

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs