Jump to Content

Speech Processing

The research goal for speech at Google aligns with our company mission: to organize the world’s information and make it universally accessible and useful. Our pioneering research work in speech processing has enabled us to build automatic speech recognition (ASR) and text-to-speech (TTS) systems that are used across Google products, with support for more than a hundred language varieties spoken across the globe. From Gboard dictation to transcriptions of voice notes, from YouTube captions to team meetings without language barriers, and from Google Maps speaking directions aloud to Google Assistant reading the news, Google’s speech research has unparalleled reach and impact. We aim to solve speech for everyone, everywhere – and work to further improve quality, speed and versatility across all kinds of speech. We're also committed to expanding our language coverage, and have set a moonshot goal to build speech technologies for 1,000 languages.

Google's speech research efforts push the state-of-the-art on architectures and algorithms used across areas like speech recognition, text-to-speech synthesis, keyword spotting, speaker recognition, and language identification. The systems we build are deployed on servers in Google’s data centers but also increasingly on-device. The team has a passion for research that leads to product advances for the billions of users that use speech in Google products today. We also release academic publications and open-source projects for the broader research community to leverage.

Our speech technologies are embedded in products like the Assistant, Search, Gboard, Translate, Maps, YouTube, Cloud, and many more. Thanks to close collaborations with product teams, we are in a unique position to deliver user-centric research. Our researchers can conduct live experiments to test and benchmark new algorithms directly in a realistic controlled environment. Whether these are algorithmic improvements or user experience and human-computer interaction studies, we focus on solving real problems with real impact on users.

We value our user diversity, and have made it a priority to deliver the best performance to every language and language variety. Today, our speech systems operate in more than 130 language varieties, and we continue to expand our reach. The challenges of internationalizing at scale are immense and rewarding. We are breaking new ground by deploying speech technologies that help people communicate, access information online, and share their knowledge – all in their language. And combined with the unprecedented translation capabilities of Google Translate, we are also at the forefront of research in speech-to-speech translation and one step closer to a universal translator.

Recent Publications

Preview abstract We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information. View details
Multimodal Language Identification
Shikhar Bharadwaj
Sriram (Sri) Ganapathy
Sid Dalmia
Wei Han
Yu Zhang
Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
Preview abstract Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition. View details
Preview abstract In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario. View details
Machine Learning for Audition
Malcolm Slaney
Preview abstract A talk for the Virtual Conference on Computational Audition to describe where ML is helping audio accessibility today and what can ML do in the future. View details
Preview abstract The quality of synthetic speech is typically evaluated using subjective listening tests. An underlying assumption is that these tests are reliable, i.e., running the test multiple times gives consistent results. A common approach to study reliability is a replication study. Existing studies focus primarily on Mean Opinion Score (MOS), and few consider the error bounds from the original test. In contrast, we present a replication study of both MOS and AB preference tests to answer two questions: (1) which of the two test types is more reliable for system comparison, and (2) for both test types, how reliable are the results with respect to their estimated standard error? We find that while AB tests are more reliable for system comparison, standard errors are underestimated for both test types. We show that these underestimates are partially due to broken independence assumptions, and suggest alternate methods of standard error estimation that account for dependencies among ratings. View details
Preview abstract Hard and soft distillation are two popular approaches for knowledge distillation from a teacher to student ASR model. Despite soft distillation being better than hard distillation, it has several limitations. First, training convergence depends on the match between the teacher and student alignments. Second, soft distillation suffers quality regressions when using teacher and student models with different architectures. Third, in case of non-causal teacher models, soft distillation requires tuning of the shift in teacher alignments to the right. Finally, soft distillation requires both the teacher and student models to have the same temporal sampling rates. In this work, we propose a novel knowledge distillation method for RNN-T models that tackles limitations of both hard and soft distillation approaches. We call our method Full-sum distillation, which simply distills the sequence posterior probability of the teacher model to the student model. Thus, this method does not depend directly on the noisy labels to distill knowledge as well as it does not depend on time dimension. We also propose a variant of Full-sum distillation to distill the sequence discriminative knowledge of the teacher model to the student model to further improve performance. Using full-sum distillation, we achieve significant improvements when training with strong and weak teacher models on public data as well as on in-house production data. View details