Olivier Siohan
Speech Processing
Research Areas
Authored Publications
Sort By
Large Scale Self-Supervised Pretraining for Active Speaker Detection
Alice Chuang
Keith Johnson
Tony (Tuấn) Nguyễn
Wei Xia
Yunfan Ye
ICASSP 2024 (2024) (to appear)
Preview abstract
In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions.
View details
Revisiting the Entropy Semiring for Neural Speech Recognition
International Conference on Learning Representations (2023) (to appear)
Preview abstract
In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
View details
On Robustness to Missing Video For Audiovisual Speech Recognition
Dmitriy (Dima) Serdyuk
Transactions on Machine Learning Research (2022)
Preview abstract
It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.
View details
Preview abstract
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap between speech recognition and active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.
View details
Preview abstract
Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality.
In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features.
The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks~\cite{Dosovitskiy2020-nh} demonstrated the ability to extract rich visual features for the image classification task.
In this work, we propose to replace the 3D convolution with a video transformer video feature extractor.
We train our baselines and the proposed model on a large scale corpus of the YouTube videos.
Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED.
Our best model video-only model achieves the performance of 34.9\% WER on YTDEV18 and 19.3\% on LRS3-TED which is a 10\% and 9\% relative improvements over the convolutional baseline.
We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6\% WER).
View details
Preview abstract
Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition
process, in particular often relying on information conveyed by the motion
of the speaker's mouth.
The use of the visual signal requires extracting visual features,
which are then combined with the acoustic features to build an AV-ASR
system~\cite{Makino2019-zd}. This is traditionally done with some form of 3D
convolution network (e.g. VGG) as widely used in the computer vision community. Recently,
video transformers~\cite{Dosovitskiy2020-nh} have been introduced to
extract visual features useful for image classification tasks.
In this work, we propose to replace the 3D convolution visual frontend
typically used for AV-ASR and lip-reading tasks by a video transformer
frontend. We train our systems on a large-scale dataset composed of
YouTube videos and evaluate performance on the publicly available
LRS3-TED set, as well as on a large set of YouTube videos. On a
lip-reading task, the transformer-based frontend shows superior
performance compared to a strong convolutional baseline. On an AV-ASR
task, the transformer frontend performs as well as a VGG frontend for
clean audio, but outperforms the VGG frontend when the audio is
corrupted by noise.
View details
Bridging the gap between streaming and non-streaming automatic speechrecognition systems through distillation of an ensemble of models
Chung-Cheng Chiu
Liangliang Cao
Ruoming Pang
Thibault Doutre
Wei Han
Interspeech'2021
Preview abstract
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their small size and minimal latency make them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context. Nevertheless, non-streaming models can be used as teacher models to improve streaming ASR systems. An arbitrarily large set of unsupervised utterances is distilled from such teacher models so that streaming models can be trained using these generated labels. However, the performance gap between teacher and student world error rates (WER) remains high. In this paper, we propose to reduce this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). Fusing RNN-T and CTC models makes stronger teachers as they improve the performance of streaming student models. In this paper, we outperform a baseline streaming RNN-T trained from non-streaming RNN-T teachers by 27\% to 42\% depending on the language.
View details
End-to-end audio-visual speech recognition for overlapping speech
INTERSPEECH 2021: Conference of the International Speech Communication Association
Preview abstract
This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers.
The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers.
This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus.
View details
Preview abstract
Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks.
View details
Preview abstract
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone.
View details