Olivier Siohan

Olivier Siohan

Speech Processing

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large Scale Self-Supervised Pretraining for Active Speaker Detection
    Alice Chuang
    Keith Johnson
    Tony (Tuấn) Nguyễn
    Wei Xia
    Yunfan Ye
    ICASSP 2024(2024) (to appear)
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Preview abstract In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario. View details
    Preview abstract Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap between speech recognition and active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR. View details
    Preview abstract It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short. View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks~\cite{Dosovitskiy2020-nh} demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9\% WER on YTDEV18 and 19.3\% on LRS3-TED which is a 10\% and 9\% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6\% WER). View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, in particular often relying on information conveyed by the motion of the speaker's mouth. The use of the visual signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system~\cite{Makino2019-zd}. This is traditionally done with some form of 3D convolution network (e.g. VGG) as widely used in the computer vision community. Recently, video transformers~\cite{Dosovitskiy2020-nh} have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolution visual frontend typically used for AV-ASR and lip-reading tasks by a video transformer frontend. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based frontend shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer frontend performs as well as a VGG frontend for clean audio, but outperforms the VGG frontend when the audio is corrupted by noise. View details
    Preview abstract Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their small size and minimal latency make them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context. Nevertheless, non-streaming models can be used as teacher models to improve streaming ASR systems. An arbitrarily large set of unsupervised utterances is distilled from such teacher models so that streaming models can be trained using these generated labels. However, the performance gap between teacher and student world error rates (WER) remains high. In this paper, we propose to reduce this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). Fusing RNN-T and CTC models makes stronger teachers as they improve the performance of streaming student models. In this paper, we outperform a baseline streaming RNN-T trained from non-streaming RNN-T teachers by 27\% to 42\% depending on the language. View details
    Preview abstract This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers. The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers. This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus. View details
    Preview abstract Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks. View details
    Preview abstract Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone. View details