Olivier Siohan
Speech Processing
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Large Scale Self-Supervised Pretraining for Active Speaker Detection
Alice Chuang
Keith Johnson
Tony (Tuấn) Nguyễn
Wei Xia
Yunfan Ye
ICASSP 2024 (2024) (to appear)
Preview abstract
In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions.
View details
Revisiting the Entropy Semiring for Neural Speech Recognition
International Conference on Learning Representations (2023) (to appear)
Preview abstract
In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
View details
Preview abstract
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap between speech recognition and active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.
View details
On Robustness to Missing Video For Audiovisual Speech Recognition
Dmitriy (Dima) Serdyuk
Transactions on Machine Learning Research (2022)
Preview abstract
It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.
View details
Preview abstract
Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality.
In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features.
The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks~\cite{Dosovitskiy2020-nh} demonstrated the ability to extract rich visual features for the image classification task.
In this work, we propose to replace the 3D convolution with a video transformer video feature extractor.
We train our baselines and the proposed model on a large scale corpus of the YouTube videos.
Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED.
Our best model video-only model achieves the performance of 34.9\% WER on YTDEV18 and 19.3\% on LRS3-TED which is a 10\% and 9\% relative improvements over the convolutional baseline.
We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6\% WER).
View details
Preview abstract
Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition
process, in particular often relying on information conveyed by the motion
of the speaker's mouth.
The use of the visual signal requires extracting visual features,
which are then combined with the acoustic features to build an AV-ASR
system~\cite{Makino2019-zd}. This is traditionally done with some form of 3D
convolution network (e.g. VGG) as widely used in the computer vision community. Recently,
video transformers~\cite{Dosovitskiy2020-nh} have been introduced to
extract visual features useful for image classification tasks.
In this work, we propose to replace the 3D convolution visual frontend
typically used for AV-ASR and lip-reading tasks by a video transformer
frontend. We train our systems on a large-scale dataset composed of
YouTube videos and evaluate performance on the publicly available
LRS3-TED set, as well as on a large set of YouTube videos. On a
lip-reading task, the transformer-based frontend shows superior
performance compared to a strong convolutional baseline. On an AV-ASR
task, the transformer frontend performs as well as a VGG frontend for
clean audio, but outperforms the VGG frontend when the audio is
corrupted by noise.
View details
End-to-end audio-visual speech recognition for overlapping speech
INTERSPEECH 2021: Conference of the International Speech Communication Association
Preview abstract
This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers.
The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers.
This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus.
View details
Preview abstract
Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks.
View details
Bridging the gap between streaming and non-streaming automatic speechrecognition systems through distillation of an ensemble of models
Chung-Cheng Chiu
Liangliang Cao
Ruoming Pang
Thibault Doutre
Wei Han
Interspeech'2021
Preview abstract
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their small size and minimal latency make them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context. Nevertheless, non-streaming models can be used as teacher models to improve streaming ASR systems. An arbitrarily large set of unsupervised utterances is distilled from such teacher models so that streaming models can be trained using these generated labels. However, the performance gap between teacher and student world error rates (WER) remains high. In this paper, we propose to reduce this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). Fusing RNN-T and CTC models makes stronger teachers as they improve the performance of streaming student models. In this paper, we outperform a baseline streaming RNN-T trained from non-streaming RNN-T teachers by 27\% to 42\% depending on the language.
View details
Preview abstract
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone.
View details
RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
Basi Garcia
Brendan Shillingford
Yannis Assael
Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (2019)
Preview abstract
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (AV) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: an internal set of YouTube utterances (YouTube-AV-Dev-18) and the publicly available TED-LRS3 set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YouTube-AV-Dev-18 set artificially corrupted with additive background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the TED-LRS3 set.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Automatic Optimization of Data Perturbation Distributions for Multi-Style Training in Speech Recognition
Mortaza Doulaty
Proceedings of the IEEE 2016 Workshop on Spoken Language Technology (SLT2016)
Preview abstract
Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set.
Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping.
The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.
View details
Preview abstract
While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme.
View details
Large Vocabulary Automatic Speech Recognition for Children
Melissa Carroll
Noah Coccaro
Qi-Ming Jiang
Interspeech (2015)
Preview abstract
Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.
View details
Multitask learning and system combination for automatic speech recognition
Preview
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
A big data approach to acoustic model training corpus selection
John Alex
Conference of the International Speech Communication Association (Interspeech) (2014)
Preview abstract
Deep neural networks (DNNs) have recently become the state
of the art technology in speech recognition systems. In this paper we propose a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition. The core of our technique consists of two steps. We first redecode speech logged by our production recognizer with a very accurate (and hence too slow for real-time usage) set of speech models to improve the quality of ground truth transcripts used for training alignments. Using confidence scores, transcript length and transcript flattening heuristics designed to cull salient utterances from three decades of speech per language, we then carefully select training data sets consisting of up to 15K hours of speech to be used to train acoustic models without any reliance on manual transcription. We show that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages.
View details
Training Data Selection Based On Context-Dependent State Matching
Proceedings of ICASSP 2014
Preview abstract
In this paper we construct a data set for semi-supervised acoustic model training by selecting spoken utterances from a massive collection of anonymized Google Voice Search utterances. Semi-supervised training usually retains high-confidence utterances which are presumed to have an accurate hypothesized transcript, a necessary condition for successful training. Selecting high confidence utterances can however restrict the diversity of the resulting data set. We propose to introduce a constraint enforcing that the distribution of the context-dependent state symbols obtained by running forced alignment of the hypothesized transcript matches a reference distribution estimated from a curated development set. The quality of the obtained training set is illustrated on large scale Voice Search recognition experiments and outperforms random selection of high-confidence utterances.
View details
Preview abstract
In large vocabulary continuous speech recognition, decision trees are widely used to cluster triphone states. In addition to commonly used phonetically based questions, others have proposed additional questions such as phone position within word or syllable. This paper examines using the word or syllable context itself as a feature in the decision tree, providing an elegant way of introducing word- or syllable-specific models into the system. Positive results are reported on two state-of-the-art systems: voicemail transcription and a search by voice tasks across a variety of acoustic model and training set sizes.
View details
An Audio Indexing System for Election Video Material
Christopher Alberti
Ari Bezman
Anastassia Drofa
Ted Power
Arnaud Sahuguet
Maria Shugrina
Proceedings of ICASSP (2009), pp. 4873-4876
Preview abstract
In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system.
View details
Vocabulary independent spoken term detection
The IBM 2007 speech transcription system for European parliamentary speeches
Comments on Vocal Tract Length Normalization Equals Linear Transformation in Cepstral Space
Mohamed Afify
IEEE Transactions on Audio, Speech & Language Processing, vol. 15 (2007), pp. 1731-1732
Automated Quality Monitoring for Call Centers using Speech and NLP Technologies
Geoffrey Zweig
George Saon
Bhuvana Ramabhadran
Daniel Povey
Lidia Mangu
Brian Kingsbury
HLT-NAACL (2006)
The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings
Jing Huang
Martin Westphal
Stanley F. Chen
Daniel Povey
Vit Libal
Alvaro Soneiro
Henrik Schulz
Thomas Ross
Gerasimos Potamianos
MLMI (2006), pp. 432-443
The IBM 2006 speech transcription system for european parliamentary speeches
Bhuvana Ramabhadran
Lidia Mangu
Geoffrey Zweig
Martin Westphal
Henrik Schulz
Alvaro Soneiro
INTERSPEECH (2006)
Fast vocabulary-independent audio search using path-based graph indexing
A new verification-based fast-match for large vocabulary continuous speech recognition
Mohamed Afify
Feng Liu
Hui Jiang 0001
IEEE Transactions on Speech and Audio Processing, vol. 13 (2005), pp. 546-553
Sequential estimation with optimal forgetting for robust speech recognition
Mohamed Afify
IEEE Transactions on Speech and Audio Processing, vol. 12 (2004), pp. 19-26
Speech recognition error analysis on the English MALACH corpus
Use of metadata to improve recognition of spontaneous speech and named entities
Advances in natural language call routing
Hong-Kwang Jeff Kuo
Joseph P. Olive
Bell Labs Technical Journal, vol. 7 (2003), pp. 155-170
Hierarchical class n-gram language models: towards better estimation of unseen events in speech recognition
Backoff hierarchical class n-gram language modelling for automatic speech recognition systems
Bell labs approach to Aurora evaluation on connected digit recognition
Jingdong Chen
Dimitris Dimitriadis
Hui Jiang 0001
Qi Li
Tor André Myrvoll
Frank K. Soong
INTERSPEECH (2002)
A dynamic in-search discriminative training approach for large vocabulary speech recognition
A discriminative training criterion and an associated EM learning algorithm
Structural maximum a posteriori linear regression for fast HMM adaptation
Towards knowledge-based features for HMM based large vocabulary automatic speech recognition
Upper and lower bounds on the mean of noisy speech: application to minimax classification
Mohamed Afify
Chin-Hui Lee
IEEE Transactions on Speech and Audio Processing, vol. 10 (2002), pp. 79-88
A real-time Japanese broadcast news closed-captioning system
Akio Ando
Mohamed Afify
Hui Jiang 0001
Chin-Hui Lee
Qi Li
Feng Liu
Kazuo Onoe
Frank K. Soong
Qiru Zhou
INTERSPEECH (2001), pp. 495-498
Joint maximum a posteriori adaptation of transformation and HMM parameters
Cristina Chesta
Chin-Hui Lee
IEEE Transactions on Speech and Audio Processing, vol. 9 (2001), pp. 417-428
Minimax classification with parametric neighborhoods for noisy speech recognition
An auditory system-based feature for robust speech recognition
A new verification-based fast match approach to large vocabulary speech recognition
Evaluating the Aurora connected digit recognition task - a bell labs approach
Mohamed Afify
Hui Jiang 0001
Filipp Korkmazskiy
Chin-Hui Lee
Qi Li
Frank K. Soong
Arun C. Surendran
INTERSPEECH (2001), pp. 633-636
Small group speaker identification with common password phrases
Aaron E. Rosenberg
S. Parthasarathy
Speech Communication, vol. 31 (2000), pp. 131-140
Constrained maximum likelihood linear regression for speaker adaptation
Extended maximum a posterior linear regression (EMAPLR) model adaptation for speech recognition
A high-performance auditory feature for robust speech recognition
Structural maximum a-posteriori linear regression for unsupervised speaker adaptation
Maximum a posteriori linear regression for hidden Markov model adaptation
Comparative experiments of several adaptation approaches to noisy speech recognition using stochastic trajectory models
Noise adaptation using linear regression for continuous noisy speech recognition
A comparison of three noisy speech recognition approaches
A Bayesian approach to phone duration adaptation for lombard speech recognition
Minimization of speech alignment error by iterative transformation for speaker adaptation