Richard Rose
Rick Rose has been a research scientist at Google in New York City since October, 2014. While at Google he has contributed to efforts in far-field speech recognition, including multi-style training, data perturbation, and multi-channel microphone dereverberation. More recently, he has been working on acoustic modeling for ASR in YouTube videos. Before coming to Google, he served as a Professor of Electrical and Computer Engineering at McGill University in Montreal since 2004, as a member of research staff at AT&T Labs / Bell Labs, and member of staff at MIT Lincoln Labs. He received his PhD degree in Electrical Engineering from the Georgia Institute of Technology. He has been active in the IEEE Signal Processing Society. He has served twice as an Associate Editor of IEEE SPS Transactions, twice as a member of the Speech Technical Committee, as member of the SPS Board of Directors, and several times on organizing committees of IEEE workshops. He is a Fellow of the IEEE.
Research Areas
Authored Publications
Sort By
End-to-end audio-visual speech recognition for overlapping speech
INTERSPEECH 2021: Conference of the International Speech Communication Association
Preview abstract
This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers.
The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers.
This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Automatic Optimization of Data Perturbation Distributions for Multi-Style Training in Speech Recognition
Mortaza Doulaty
Proceedings of the IEEE 2016 Workshop on Spoken Language Technology (SLT2016)
Preview abstract
Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set.
Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping.
The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.
View details