Jump to Content
Yonghui Wu

Yonghui Wu

Yonghui Wu joined Google in Sep 2008 first as a ranking engineer improving Google's core web search ranking algorithm. Since Jan 2015, he has been with the Google Brain team focus on deep learning and its applications. His research interests are in Information Retrieval, Learning to Rank, Machine Learning, Machine Translation, Natural Language Processing and etc.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
    Tao Zhu
    Zirui Wang
    Mi Zhang
    Soham Ghosh
    Jiahui Yu
    arxiv.org, Cornell University (2023)
    Preview abstract We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning. View details
    Preview abstract Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don't Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark. View details
    Preview abstract We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network. View details
    SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems
    Bin Zhang
    AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (2022)
    Preview abstract Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X - a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness. View details
    Preview abstract In this paper we share findings from our effort towards building practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results across three research domains: (i) Building clean, web-mined datasets by leveraging semi-supervised pre-training for language-id and developing data-driven filtering techniques; (ii) Leveraging massively multilingual MT models trained with supervised parallel data for over $100$ languages and small monolingual datasets for over $1000$ languages to enable translation for several previously under-studied languages; and (iii) Studying the limitations of evaluation metrics for long tail languages and conducting qualitative analysis of the outputs from our MT models. We hope that our work provides useful insights to practitioners working towards building MT systems for long tail languages, and highlights research directions that can complement the weaknesses of massively multilingual pre-trained models in data-sparse settings. View details
    Sparsely Activated Language Models are Efficient In-Context Learners
    Barret Richard Zoph
    Dmitry (Dima) Lepikhin
    Emma Wang
    Kun Zhang
    Liam B. Fedus
    Maarten Paul Bosma
    Marie Pellat
    Maxim Krikun
    Nan Du
    Simon Tong
    Tao Wang
    Toju Duke
    Yuanzhong Xu
    Zongwei Zhou
    Preview abstract Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3. View details
    Preview abstract Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking" mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks. View details
    CoCa: Contrastive Captioners are Image-Text Foundation Models
    Jiahui Yu
    Zirui Wang
    Vijay Vasudevan
    Transactions on Machine Learning Research, vol. Aug 2022 (2022)
    Preview abstract Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder. View details
    Preview abstract Transfer tasks in text-to-speech (TTS) synthesis — where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally — remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training an robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; In particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style. View details
    Preview abstract Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech. View details
    Preview abstract Streaming automatic speech recognition (ASR) aims at emitting each recognized word shortly as they are spoken, while full-context ASR encodes an entire speech sequence before decoding texts. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. More importantly, we show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. Universal ASR framework is network-agnostic, and can be applied to recent state-of-the-art convolution-based and transformer-based end-to-end ASR networks. We present extensive experiments on both research dataset LibriSpeech and mega-scale internal dataset MultiDomain with two state-of-the-art ASR networks ContextNet and Conformer. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. View details
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    Interpretable Ranking with Generalized Additive Models
    Alexander Grushetsky
    Petr Mitrichev
    Ethan Sterling
    Nathan Bell
    Walker Ravina
    Hai Qian
    Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM) (2021)
    Preview abstract Interpretability of ranking models is a crucial yet relatively under-examined research area. Recent progress on this area largely focuses on generating post-hoc explanations for existing black-box ranking models. Though promising, such post-hoc methods cannot provide sufficiently accurate explanations in general, which makes them infeasible in many high-stakes scenarios, especially the ones with legal or policy constraints. Thus, building an intrinsically interpretable ranking model with transparent, self-explainable structure becomes necessary, but this remains less explored in the learning-to-rank setting. In this paper, we lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks. Generalized additive models (GAMs) are intrinsically interpretable machine learning models and have been extensively studied on regression and classification tasks. We study how to extend GAMs into ranking models which can handle both item-level and list-level features and propose a novel formulation of ranking GAMs. To instantiate ranking GAMs, we employ neural networks instead of traditional splines or regression trees. We also show that our neural ranking GAMs can be distilled into a set of simple and compact piece-wise linear functions that are much more efficient to evaluate with little accuracy loss. We conduct experiments on three data sets and show that our proposed neural ranking GAMs can outperform other traditional GAM baselines while maintaining similar interpretability. View details
    Preview abstract This paper introduces a new encoder model for neural TTS. The proposed model, called PnG BERT, is augmented from the original BERT model, but taking both phoneme and grapheme representation of a text, as well as the word-level alignment between them, as its input. It can be pre-trained on a large text corpus in a self-supervised manner then fine-tuned in a TTS task. The experimental results suggest that PnG BERT can significantly further improve the performance of a state-of-the-art neural TTS model, by producing more appropriate prosody and more accurate pronunciation. Subjective side-by-side preference evaluation showed that raters had no statistically significant preference between the synthesized speech and the ground truth recordings from professional speakers. View details
    Preview abstract Although neural end-to-end text-to-speech models can synthesizehighly natural speech, there is still a room for improvements in itsefficiency during inference. This paper proposes a non-autoregressiveneural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, calledParallel Tacotron, is highlyparallelizable during both training and inference, allowing efficientsynthesis on modern parallel hardware. The use of the variationalautoencoder helps to relax the one-to-many mapping nature of thetext-to-speech problem. To further improve the naturalness, weintroduce an iterative spectrogram loss, which is inspired by iterativerefinement, and lightweight convolution, which can efficiently capturelocal contexts. Experimental results show that Parallel Tacotronmatches a strong autoregressive baseline in subjective naturalnesswith significantly decreased inference time. View details
    Preview abstract We present a state-of-the-art non-autoregressive Text-To-Speech model. The model called Parallel Tacotron 2 learns to synthesize speech with good quality without supervised duration signals and other assumptions about the token-to-frame mapping. Specifically, we introduce a novel learned attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping. We show that this new unsupervised model outperforms the baselines in naturalness in several diverse multi speaker evaluations. Further, we show that the explicit duration model that the model has learned can be used to control the synthesized speech. View details
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions. View details
    Conformer: Convolution-augmented Transformer for Speech Recognition
    Anmol Gulati
    James Qin
    Jiahui Yu
    Niki Parmar
    Ruoming Pang
    Shibo Wang
    Wei Han
    Yu Zhang
    Zhengdong Zhang
    (2020) (to appear)
    Preview abstract Recently end-to-end transformers and convolution neural networks have shown promising results in Automatic Speech Recognition (ASR), outperforming recurrent neural networks (RNNs). In this work, we study how to combine convolutions and transformers to model both global interactions and the local patterns of an audio sequence in a parameter-efficient way. We propose the convolution-augmented transformer for speech recognition, named \textit{Conformer}. \textit{Conformer} achieves state-of-the-art accuracies while being parameter-efficient, outperforming all previous models in ASR. On the widely used Librispeech benchmark, our model achieves WER of 2.1%/4.3% and 1.9%/3.9% with external language model. Our small sized model with 10M parameters achieves 2.7%/6.3%. View details
    Preview abstract Speech synthesis has advanced to the point of being close to indistinguishable from human speech. However, efforts to train speech recognition systems on synthesized utterances have not been able to show that synthesized data can be effectively used to augment or replace human speech. In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance. We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2\% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half. View details
    Preview abstract This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, where the attention mechanism is replaced with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two new metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score in naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. If accurate target duration are scarce or unavailable, it is still possible to train a duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training. View details
    ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
    Wei Han
    Zhengdong Zhang
    Yu Zhang
    Jiahui Yu
    James Qin
    Anmol Gulati
    Ruoming Pang
    INTERSPEECH (2020) (to appear)
    Preview abstract Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset. View details
    Preview abstract Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation. View details
    Preview abstract Recently proposed approaches for fine-grained prosody control of end-to-end text-to-speech samples enable precise control of the prosody of synthesized speech. Such models incorporate a fine-grained variational autoencoder (VAE) structure into a sequence-to-sequence model, extracting latent prosody features for each input token (e.g.\ phonemes). Generating samples using the standard VAE prior, an independent Gaussian at each time step, results in very unnatural and discontinuous speech, with dramatic variation between phonemes. In this paper we propose a sequential prior in the discrete latent space which can be used to generate more natural samples. This is accomplished by discretizing the latent prosody features using vector quantization, and training an autoregressive (AR) prior model over the result. The AR prior is learned separately from the training of the posterior. We evaluate the approach using subjective listening tests, objective metrics of automatic speech recognition (ASR) performance, as well as measurements of prosody attributes including volume, pitch, and phoneme duration. Compared to the fine-grained VAE baseline, the proposed model achieves equally good copy synthesis reconstruction performance, but significantly improves naturalness in sample generation. The diversity of the prosody in random samples better matches that of the real speech. Furthermore, initial experiments demonstrate that samples generated from the quantized latent sapce can be used as an effective data augmentation strategy to improve ASR performance. View details
    Preview abstract End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, Recurrent neural network transducer (RNN-T) as a streaming E2E model that has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction on a Voice search test set. We also experimented with a second pass Listen, Attend and Spell (LAS) rescorer for the RNN-T EP model. Although it cannot directly improve the first pass latency, the large WER reduction actually give us more room to trade WER for latency. RNN-T+LAS, together with EMBR training brings in 17.3% relative WER reduction while maintaining similar 120ms 90-percentile latency reductions. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Preview abstract End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Previous studies have proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-totranscript or text-to-foreign-text pairs. In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally, we discuss methods for avoiding overfitting to synthetic speech with a quantitative ablation study. View details
    Preview abstract This paper introduces a new speech corpus called ``LibriTTS'' designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/. View details
    Preview abstract Multilingual end-to-end (E2E) models have shown great promise as a means to expand coverage of the world’s lan- guages by automatic speech recognition systems. They im- prove over monolingual E2E systems, especially on low re- source languages, and simplify training and serving by elimi- nating language-specific acoustic, pronunciation, and language models. This work aims to develop an E2E multilingual system which is equipped to operate in low-latency interactive applica- tions as well as handle the challenges of real world imbalanced data. First, we present a streaming E2E multilingual model. Second, we compare techniques to deal with imbalance across languages. We find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model system achieves lower word error rate (WER) than state-of-the- art conventional monolingual models by at least 10% relative on every language. View details
    Preview abstract We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly the case for multilingual processing. In this work, we model text via a sequence of unicode bytes. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in end-to-end speech recognition. We also present an end-to-end multilingual model using unicode byte representations, which outperforms each respective single language baseline by 4~5\% relatively. Finally, we present an end-to-end multilingual speech synthesis model using unicode byte representations which also achieves state-of-the-art performance. View details
    Preview abstract We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task. View details
    Preview abstract The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models. However, this model still lags behind a large state-of-the-art conventional model in quality. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T. View details
    Preview abstract Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models. View details
    Preview abstract We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents. View details
    Preview abstract To leverage crowd-sourced data to train multi-speaker text-to-speech (TTS) models that can synthesize clean speech for all speakers, it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. However, learning such representations can be challenging, due to the lack of labels describing the recording conditions of each training example, and the fact that speakers and recording conditions are often correlated, e.g. since users often make many recordings using the same equipment. This paper proposes three components to address this problem by: (1) formulating a conditional generative model with factorized latent variables, (2) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (3) using adversarial factorization to improve disentanglement. Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers. Ablation studies verify the importance of each proposed component. View details
    Preview abstract We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to the quality and practicality towards universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research. View details
    Hierarchical Generative Modeling for Controllable Speech Synthesis
    Wei-Ning Hsu
    Yu Zhang
    Yuxuan Wang
    Ye Jia
    Jonathan Shen
    Patrick Nguyen
    Ruoming Pang
    International Conference on Learning Representations (2019)
    Preview abstract This paper proposes a neural end-to-end text-to-speech model which can control latent attributes in the generation of speech, that are rarely annotated in the training data (e.g. speaking styles, accents, background noise level, and recording conditions). The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation of the proposed model demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker. View details
    Preview abstract Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. View details
    Preview abstract In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing. In the design and deployment of such a large-scale and complicated system, we faced several challenges including model selection, performance evaluation, serving and other practical issues. At the core of Smart Compose is a large-scale neural language model. We leveraged state-of-the-art machine learning techniques for language model training which enabled high-quality suggestion prediction, and constructed novel serving infrastructure for high-throughput and real-time inference. Experimental results show the effectiveness of our proposed system design and deployment approach. This system is currently being served in Gmail. View details
    Speech recognition for medical conversations
    Kat Chou
    Chris Co
    Navdeep Jaitly
    Diana Jaunzeikare
    Patrick Nguyen
    Ananth Sankar
    Justin Jesada Tansuwan
    Nathan Wan
    Frank Zhang
    Interspeech 2018 (2018)
    Preview abstract In this paper we document our experiences with developing speech recognition for Medical Transcription -- a system that automatically transcribes notes from doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -- a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech . Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.5%. View details
    Preview abstract End-to-end models which are trained to directly output grapheme or word-piece targets have been demonstrated to be competitive with conventional speech recognition models. Such models do not require additional resources for decoding, and are typically much smaller than conventional models while makes them particularly attractive in the context of on-device speech recognition where both small memory footprint and low power consumption are critical. With these constraints in mind, in this work, we consider the problem of compressing end-to-end models with the goal of minimizing the number of model parameters without sacrificing model accuracy. We explore matrix factorization, knowledge distillation and parameter sparsity to determine the most effect method given a fixed parameter budget. View details
    Preview abstract While current state-of-the-art NMT models, both LSTM based and Transformers, are much deeper compared to their early counterparts, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper transformer and BiLSTM encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in significant improvements on the benchmark WMT'14 English-German and WMT'15 Czech-English tasks for both architectures. View details
    Preview abstract Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER. View details
    Preview abstract Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task. View details
    Preview abstract Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and requires only a parallel corpus of speech and text for training. However, unlike in conventional approaches that combine separate acoustic and language models, it is not clear how to use additional (unpaired) text. While there has been previous work on methods addressing this problem, a thorough comparison among methods is still lacking. In this paper, we compare a suite of past methods and some of our own proposed methods for using unpaired text data to improve encoder-decoder models. For evaluation, we use the medium-sized Switchboard data set and the large-scale Google voice search and dictation data sets. Our results confirm the benefits of using unpaired text across a range of methods and data sets. Surprisingly, for first-pass decoding, the rather simple approach of shallow fusion performs best across data sets. However, for Google data sets we find that cold fusion has a lower oracle error rate and outperforms other approaches after second-pass rescoring on the Google voice search data set. View details
    Preview abstract The past year has witnessed rapid advances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolutional seq2seq model, which was then out-performed by the more recent Transformer model. Each of these new approaches consists of a fundamental architecture accompanied by a set of modeling and training techniques that are in principle applicable to other seq2seq architectures. In this paper, we tease apart the new architectures and their accompanying techniques in two ways. First, we identify several key modeling and training techniques, and apply them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT'14 English to French and English to German tasks. Second, we analyze the properties of each fundamental seq2seq architecture and devise new hybrid architectures intended to combine their strengths. Our hybrid models obtain further improvements, outperforming the RNMT+ model on both benchmark datasets. View details
    Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions
    Jonathan Shen
    Ruoming Pang
    Mike Schuster
    Navdeep Jaitly
    Zongheng Yang
    Yu Zhang
    Yuxuan Wang
    Yannis Agiomyrgiannakis
    ICASSP (2018)
    Preview abstract This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture. View details
    Preview abstract For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phoneme-based models. We also compare grapheme and phoneme-based end-to-end approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects. View details
    Preview abstract Having an sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown to degrade significantly in performance compared to non-streaming models such as Listen, Attend and Spell (LAS). In this paper, we present various improvements to NT. Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally. we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS. View details
    Preview abstract We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. View details
    Preview abstract Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, language model, and alignment mechanism. Thus, the language model component is only trained on transcribed audio-text pairs. This leads to the use of shallow fusion with an external language model at inference time. Shallow fusion refers to log-linear interpolation with a separately trained language model at each step of the beam search. In this work, we investigate the behavior of shallow fusion across a range of conditions: different types of language models, different decoding units, and different tasks. On Google Voice Search, we demonstrate that the use of shallow fusion with an neural LM with wordpieces yields a 9.1% relative word error rate reduction (WERR) over our competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring. View details
    Preview abstract We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points. View details
    Preview abstract A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given (text, audio) pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods. View details
    Preview abstract We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English->French and surpasses state-of-the-art results for English->German. Similarly, a single multilingual model surpasses state-of-the-art results for French->English and German->English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages. View details
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Mike Schuster
    Mohammad Norouzi
    Maxim Krikun
    Qin Gao
    Apurva Shah
    Xiaobing Liu
    Łukasz Kaiser
    Stephan Gouws
    Taku Kudo
    Keith Stevens
    George Kurian
    Nishant Patil
    Wei Wang
    Jason Smith
    Alex Rudnick
    Macduff Hughes
    CoRR, vol. abs/1609.08144 (2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details
    Exploring the limits of language modeling
    Rafal Jozefowicz
    Mike Schuster
    Noam Shazeer
    Google Inc. (2016)
    Preview abstract This paper shows recent advances for large scale neural language modeling, a task central to language understanding. Our goal is to show how well large neural language models can perform on a large LM benchmark corpus, for which we chose the One Billion Word Benchmark. Using various techniques, our best single model significantly improves state-of-the-art perplexity from 51.3 to 30.0, while an ensemble of models sets a new record by improving perplexity from 41.0 to 23.7. View details
    Preview abstract A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the log-likelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional log-probability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood. View details
    No Results Found