Otavio Braga

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large Scale Self-Supervised Pretraining for Active Speaker Detection
    Alice Chuang
    Keith Johnson
    Tony (Tuấn) Nguyễn
    Wei Xia
    Yunfan Ye
    ICASSP 2024 (2024) (to appear)
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Preview abstract Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap between speech recognition and active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR. View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks~\cite{Dosovitskiy2020-nh} demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9\% WER on YTDEV18 and 19.3\% on LRS3-TED which is a 10\% and 9\% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6\% WER). View details
    Preview abstract It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short. View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, in particular often relying on information conveyed by the motion of the speaker's mouth. The use of the visual signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system~\cite{Makino2019-zd}. This is traditionally done with some form of 3D convolution network (e.g. VGG) as widely used in the computer vision community. Recently, video transformers~\cite{Dosovitskiy2020-nh} have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolution visual frontend typically used for AV-ASR and lip-reading tasks by a video transformer frontend. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based frontend shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer frontend performs as well as a VGG frontend for clean audio, but outperforms the VGG frontend when the audio is corrupted by noise. View details
    Preview abstract This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers. The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers. This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus. View details
    Preview abstract Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks. View details
    Preview abstract Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone. View details
    RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
    Basi Garcia
    Brendan Shillingford
    Yannis Assael
    Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (2019)
    Preview abstract This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (AV) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: an internal set of YouTube utterances (YouTube-AV-Dev-18) and the publicly available TED-LRS3 set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YouTube-AV-Dev-18 set artificially corrupted with additive background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the TED-LRS3 set. View details