Tom Bagby

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the associations between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data set-up, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain structure in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data is sufficient to learn to relate the two modalities (graphemes and phonemes here) when loads of unpaired data is available, paving the path to adopting this principled approach for ASR and TTS models in low resource data regimes. View details
    Preview abstract This work explores the task of synthesizing speech in human-sounding voices unseen in any training set. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a deep generative text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. View details
    Preview abstract Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances. View details
    Preview abstract Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Preview abstract We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised methods. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 0.5\% (15 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. View details
    Preview abstract This article introduces and evaluates Sampled Connectionist Temporal Classification (CTC) which connects the CTC criterion to the Cross Entropy (CE) objective through sampling. Instead of com- puting the logarithm of the sum of the alignment path likelihoods, at each training step the sampled CTC only computes the CE loss be- tween the sampled alignment path and model posteriors. It is shown that the sampled CTC objective is an unbiased estimator of an upper bound for the CTC loss, thus minimization of the sampled CTC is equivalent to the minimization of the upper bound of the CTC ob- jective. The definition of the sampled CTC objective has the advan- tage that it is scalable computationally to the massive datasets using accelerated computation machines. The sampled CTC is compared with CTC in two large-scale speech recognition tasks and it is shown that sampled CTC can achieve similar WER performance of the best CTC baseline in about one fourth of the training time of the CTC baseline. View details
    Preview abstract Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators [1]. The literature so far does not address - how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution Recurrent Neural Networks (ceRNNs), which is similar to uRNNs but drops the unitary property selectively. On a simple multivariate linear regression task, we illustrate that dropping the constraints improves the learning trajectory. In copy memory task, ceRNNs and uRNNs perform identically, demonstrating that their superior performance over LSTMs is due to complex-valued nature and their linear operators. In a large scale real-world speech recognition, we find that pre-pending a uRNN degrades the performance of our baseline LSTM acoustic models, while pre-pending a ceRNN improves the performance over the baseline by 0.8% absolute WER. View details
    Preview abstract This article discusses strategies for end-to-end training of state- of-the-art acoustic models for Large Vocabulary Continuous Speech Recognition (LVCSR), with the goal of leveraging Ten- sorFlow components so as to make efficient use of large-scale training sets, large model sizes, and high-speed computation units such as Graphical Processing Units (GPUs). Benchmarks are presented that evaluate the efficiency of different approaches to batching of training data, unrolling of recurrent acoustic models, and device placement of TensorFlow variables and op- erations. An overall training architecture developed in light of those findings is then described. The approach makes it possi- ble to take advantage of both data parallelism and high speed computation on GPU for state-of-the-art sequence training of acoustic models. The effectiveness of the design is evaluated for different training schemes and model sizes, on a 20, 000 hour Voice Search task. View details