Audio Set: An ontology and human-labeled dataset for audio events
Abstract
Audio event recognition, the human-like ability to identify and relate sounds
from audio, is a nascent problem in machine perception. Comparable problems such
as object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of
Audio Set, a large-scale dataset of manually-annotated audio events that
endeavors to bridge the gap in data availability between image and audio
research. Using a carefully structured hierarchical ontology of 635 audio
classes guided by the literature and manual curation, we collect data from human
labelers to probe the presence of specific audio classes in 10 second segments
of YouTube videos. Segments are proposed for labeling using searches based on
metadata, context (e.g., links), and content analysis. The
result is a dataset of unprecedented breadth and size that will, we hope,
substantially stimulate the development of high-performance audio event
recognizers.
from audio, is a nascent problem in machine perception. Comparable problems such
as object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of
Audio Set, a large-scale dataset of manually-annotated audio events that
endeavors to bridge the gap in data availability between image and audio
research. Using a carefully structured hierarchical ontology of 635 audio
classes guided by the literature and manual curation, we collect data from human
labelers to probe the presence of specific audio classes in 10 second segments
of YouTube videos. Segments are proposed for labeling using searches based on
metadata, context (e.g., links), and content analysis. The
result is a dataset of unprecedented breadth and size that will, we hope,
substantially stimulate the development of high-performance audio event
recognizers.