Zhonghua Xi
Research Areas
Authored Publications
Sort By
AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
Ondrej Klejch
Radhika Marvin
Liat Kaver
Sharadh Ramaswamy
Arkadiusz Stopczynski
ICASSP, IEEE (2020)
Preview abstract
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
View details
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
Liat Kaver
Radhika Marvin
Nathan Christopher Reale
Loretta Guarino Reid
Proceedings of Interspeech, 2018
Preview abstract
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings and with multiple variations tailored toward applications. Unfortunately, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches in similar settings and to understand their strengths and weaknesses. In this paper, we describe a new dataset of densely labeled speech activity in YouTube video clips, which has been designed to address these issues and will be released publicly. The dataset labels go beyond speech alone, annotating three specific speech activity situations: clean speech, speech and music co-occurring, and speech and noise co-occurring. These classes will enable further analysis of model performance in the presence of noise. We report benchmark performance numbers on this dataset using state-of-the-art audio and vision models.
View details