AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Joseph Roth; Sourish Chaudhuri; Ondrej Klejch; Radhika Marvin; Andrew C. Gallagher; Liat Kaver; Sharadh Ramaswamy; Arkadiusz Stopczynski; Cordelia Schmid; Zhonghua Xi; Caroline Pantofaru

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Joseph Roth

Sourish Chaudhuri

Ondrej Klejch

Radhika Marvin

Andrew C. Gallagher

Liat Kaver

Sharadh Ramaswamy

Arkadiusz Stopczynski

Cordelia Schmid

Zhonghua Xi

Caroline Pantofaru

ICASSP, IEEE (2020)

Download Google Scholar

Abstract

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs