AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Sourish Chaudhuri; Joseph Roth; Dan Ellis; Andrew C. Gallagher; Liat Kaver; Radhika Marvin; Caroline Pantofaru; Nathan Christopher Reale; Loretta Guarino Reid; Kevin Wilson; Zhonghua Xi

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Sourish Chaudhuri

Joseph Roth

Dan Ellis

Andrew C. Gallagher

Liat Kaver

Radhika Marvin

Caroline Pantofaru

Nathan Christopher Reale

Loretta Guarino Reid

Kevin Wilson

Zhonghua Xi

Proceedings of Interspeech, 2018

Download Google Scholar

Abstract

Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings and with multiple variations tailored toward applications. Unfortunately, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches in similar settings and to understand their strengths and weaknesses. In this paper, we describe a new dataset of densely labeled speech activity in YouTube video clips, which has been designed to address these issues and will be released publicly. The dataset labels go beyond speech alone, annotating three specific speech activity situations: clean speech, speech and music co-occurring, and speech and noise co-occurring. These classes will enable further analysis of model performance in the presence of noise. We report benchmark performance numbers on this dataset using state-of-the-art audio and vision models.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs