RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION

Basi Garcia
Brendan Shillingford
Yannis Assael
Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (2019)

Abstract

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (AV) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: an internal set of YouTube utterances (YouTube-AV-Dev-18) and the publicly available TED-LRS3 set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YouTube-AV-Dev-18 set artificially corrupted with additive background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the TED-LRS3 set.