- Basi Garcia
- Brendan Shillingford
- Hank Liao
- Olivier Siohan
- Otavio de Pinho Forin Braga
- Takaki Makino
- Yannis Assael
Abstract
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (AV) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: an internal set of YouTube utterances (YouTube-AV-Dev-18) and the publicly available TED-LRS3 set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YouTube-AV-Dev-18 set artificially corrupted with additive background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the TED-LRS3 set.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work