Large Scale Self-Supervised Pretraining for Active Speaker Detection

Alice Chuang
Keith Johnson
Tony (Tuấn) Nguyễn
Wei Xia
Yunfan Ye
ICASSP 2024 (2024) (to appear)
Google Scholar

Abstract

In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions.

Research Areas