A CLOSER LOOK AT AUDIO-VISUAL MULTI-PERSON SPEECH RECOGNITION AND ACTIVE SPEAKER SELECTION
Abstract
Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks.