Skip to main content

Explore our many areas of focus

Explore all research areas

Applied AI & sciences

Sustainability & crisis resilience

Foundational ML & algorithms

Algorithms & theory

Information retrieval

Machine intelligence

Machine perception

Natural language processing

People, systems & quantum AI

Human-computer interaction and visualization

Software engineering

Software systems

Learn More

Building a collaborative ecosystem

Access high-quality datasets to accelerate your research.

Tools & services

Explore our latest AI models and products.

Discover open-source code and collaborate with the community.

Shaping the future together

See all programs

Faculty programs

Participating in the academic research community through meaningful engagement with university faculty.

Student programs

Supporting the next generation of researchers through a wide range of programming.

Find your place in our global offices and research labs.

Translating discovery into real-world impact

Our researchers drive advancements in computer science through both fundamental and applied research.

Collaborative groups tackling the world's most challenging AI problems.

Research

Explore our many areas of focus

Explore all research areas

Applied AI & sciences

Sustainability & crisis resilience

Foundational ML & algorithms

Algorithms & theory

Information retrieval

Machine intelligence

Machine perception

Natural language processing

People, systems & quantum AI

Human-computer interaction and visualization

Software engineering

Software systems

Learn More

Resources

Building a collaborative ecosystem

Access high-quality datasets to accelerate your research.

Tools & services

Explore our latest AI models and products.

Discover open-source code and collaborate with the community.

Conferences & events

Careers

Shaping the future together

See all programs

Faculty programs

Participating in the academic research community through meaningful engagement with university faculty.

Student programs

Supporting the next generation of researchers through a wide range of programming.

Find your place in our global offices and research labs.

Blog

About

Translating discovery into real-world impact

Our researchers drive advancements in computer science through both fundamental and applied research.

Collaborative groups tackling the world's most challenging AI problems.

Google Research

Learn about all our AI

Google DeepMind

Explore the frontier of AI

Try our AI experiments

Conferences & events

Blog

Home
People

Tom Bagby

Research Areas

Natural language processing

Authored Publications

results

Filter by:

Publications

Google 10
Other 0

Years

2025 1
2023 1
2022 1
2021 1
2020 1
2019 4
2018 2
2017 1

Research Areas

Machine Intelligence 7
Natural Language Processing 1
Speech Processing 8

Teams

Language 1

Sort By

Title
Title, descending
Year
Year, descending

chip template

Massive Sound Embedding Benchmark (MSEB)

Cyril Allauzen

Georg Heigold

Ji Ma

Ehsan Variani

Michael Riley

Tom Bagby

2025

Preview abstract Although sound information extraction appear distinct across spectrum of sound classes and technologies, all inherently involve creating some form of "embedding"—be it discrete as in textual tokens or continuous vectors—to encapsulate relevant information from the audio signal for downstream utilization. This unifying framework allows us to re-evaluate sound information extraction by researching the optimality of current task-specific representations, the quality headroom and the potential for a single, robust sound embedding to generalize across diverse applications and sound types. To expedite research in these directions, a standardized evaluation benchmark is indispensable, mirroring the established benchmarks in text and image domains. We present the Massive Sound Embedding Benchmark (MSEB) to serve this purpose. MSEB encompasses realistic tasks and datasets that reflect practical applications across diverse technologies and sound categories. Initial experimental findings indicate substantial headroom for enhancing prevalent information extraction methodologies. We encourage the sound processing community to contribute data and tasks to MSEB and employ it to assess their algorithms for improved overall sound encoding. View details

Generative semi-supervised learning with a neural seq2seq noisy channel

Soroosh Mariooryad

Matt Shannon

Siyuan Ma

Tom Bagby

David Kao

Daisy Stanton

Eric Battenberg

RJ Skerry-Ryan

ICML Workshop on Structured Probabilistic Inference (2023)

Preview abstract We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the associations between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data set-up, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain structure in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data is sufficient to learn to relate the two modalities (graphemes and phonemes here) when loads of unpaired data is available, paving the path to adopting this principled approach for ASR and TTS models in low resource data regimes. View details

Speaker Generation

Daisy Stanton

David Teh-Hwa Kao

Eric Battenberg

Matt Shannon

RJ Skerry-Ryan

Soroosh Mariooryad

Tom Bagby

ICASSP (2022)

Preview abstract This work explores the task of synthesizing speech in human-sounding voices unseen in any training set. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a deep generative text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. View details

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Eric Battenberg

RJ Skerry-Ryan

Soroosh Mariooryad

Daisy Stanton

David Kao

Matt Shannon

Tom Bagby

ICASSP (2020)

Preview abstract Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances. View details

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Eric Battenberg

Soroosh Mariooryad

Daisy Stanton

RJ Skerry-Ryan

Matt Shannon

David Kao

Tom Bagby

arXiv (2019)

Preview abstract Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web. View details

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

Raza Habib

Soroosh Mariooryad

Matt Shannon

Eric Battenberg

RJ Skerry-Ryan

Daisy Stanton

David Kao

Tom Bagby

ICLR (2019)

Preview abstract We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised methods. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 0.5\% (15 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. View details

STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES

Yanzhang He

Tara Sainath

Rohit Prabhavalkar

Ian McGraw

Raziel Alvarez

Ding Zhao

David Rybach

Anjuli Kannan

Yonghui Wu

Ruoming Pang

Qiao Liang

Deepti Bhatia

Yuan Shangguan

Bo Li

Golan Pundak

Khe Chai Sim

Tom Bagby

Shuo-yiin Chang

Kanishka Rao

Alex Gruenstein

ICASSP (2019)

Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details

Sampled Connectionist Temporal Classification

Ehsan Variani

Erik McDermott

Kamel Lahouel

Michiel Bacchiani

Tom Bagby

ICASSP 2018 (2018)

Preview abstract This article introduces and evaluates Sampled Connectionist Temporal Classification (CTC) which connects the CTC criterion to the Cross Entropy (CE) objective through sampling. Instead of com- puting the logarithm of the sum of the alignment path likelihoods, at each training step the sampled CTC only computes the CE loss be- tween the sampled alignment path and model posteriors. It is shown that the sampled CTC objective is an unbiased estimator of an upper bound for the CTC loss, thus minimization of the sampled CTC is equivalent to the minimization of the upper bound of the CTC ob- jective. The definition of the sampled CTC objective has the advan- tage that it is scalable computationally to the massive datasets using accelerated computation machines. The sampled CTC is compared with CTC in two large-scale speech recognition tasks and it is shown that sampled CTC can achieve similar WER performance of the best CTC baseline in about one fourth of the training time of the CTC baseline. View details

Complex Evolution Recurrent Neural Networks (ceRNNs)

Izhak Shafran

RJ Skerry-Ryan

Tom Bagby

IEEE ICASSP 2018

Preview abstract Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators [1]. The literature so far does not address - how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution Recurrent Neural Networks (ceRNNs), which is similar to uRNNs but drops the unitary property selectively. On a simple multivariate linear regression task, we illustrate that dropping the constraints improves the learning trajectory. In copy memory task, ceRNNs and uRNNs perform identically, demonstrating that their superior performance over LSTMs is due to complex-valued nature and their linear operators. In a large scale real-world speech recognition, we find that pre-pending a uRNN degrades the performance of our baseline LSTM acoustic models, while pre-pending a ceRNN improves the performance over the baseline by 0.8% absolute WER. View details

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

Ehsan Variani

Tom Bagby

Erik McDermott

Michiel Bacchiani

Interspeech 2017 (2017)

Preview abstract This article discusses strategies for end-to-end training of state- of-the-art acoustic models for Large Vocabulary Continuous Speech Recognition (LVCSR), with the goal of leveraging Ten- sorFlow components so as to make efficient use of large-scale training sets, large model sizes, and high-speed computation units such as Graphical Processing Units (GPUs). Benchmarks are presented that evaluate the efficiency of different approaches to batching of training data, unrolling of recurrent acoustic models, and device placement of TensorFlow variables and op- erations. An overall training architecture developed in light of those findings is then described. The approach makes it possi- ble to take advantage of both data parallelism and high speed computation on GPU for state-of-the-art sequence training of acoustic models. The effectiveness of the design is evaluated for different training schemes and model sizes, on a 20, 000 hour Voice Search task. View details

Search on Google Scholar

Join us

We're always looking for more talented, passionate people.

See opportunities

Follow us

Explore our other initiatives

Google AI

Discover how Google AI is committed to enriching knowledge and solving complex challenges

Products
Build
Research
Responsibility
Societal Impact
About

Google Cloud

High-performance infrastructure for cloud computing, data analytics & machine learning

Overview
Solutions
Products
Pricing
Resources

Google DeepMind

Our mission is to build AI responsibly to benefit humanity

Models
Research
Science
About

Google Labs

Explore the future of AI responsibly with Google Labs

About
Experiments
Stay connected

Google Products

×