Yuan Cao
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Tao Zhu
Zirui Wang
Mi Zhang
Soham Ghosh
Jiahui Yu
arxiv.org, Cornell University (2023)
Preview abstract
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
View details
Description-Driven Task-Oriented Dialog Modeling
Dian Yu
Mingqiu Wang
ACL (2022)
Preview abstract
Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking" mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks.
View details
Preview abstract
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.
View details
SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems
Bin Zhang
AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (2022)
Preview abstract
Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X - a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness.
View details
Preview abstract
Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don't Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark.
View details
Preview abstract
Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU).
Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
Preview abstract
Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.
View details
Preview abstract
Encoder-decoder networks with attention have proven to be a powerful way to solve
many sequence-to-sequence tasks. In these networks, attention aligns encoder and
decoder states and is often used for visualizing network behavior. However, the
mechanisms used by networks to generate appropriate attention matrices are still
mysterious. Moreover, how these mechanisms vary depending on the particular
architecture used for the encoder and decoder (recurrent, feed-forward, etc.) are also
not well understood. In this work, we investigate how encoder-decoder networks
solve different sequence-to-sequence tasks. We introduce a way of decomposing
hidden states over a sequence into temporal (independent of input) and inputdriven (independent of sequence position) components. This reveals how attention
matrices are formed: depending on the task requirements, networks rely more
heavily on either the temporal or input-driven components. These findings hold
across both recurrent and feed-forward architectures despite their differences in
forming the temporal components. Overall, our results provide new insight into the
inner workings of attention-based encoder-decoder networks.
View details
Preview abstract
Federated learning is used for decentralized training of machine learning models on millions of edge mobile devices. This is challenging because these devices often have limited communication bandwidth, and local computation resources. We exploit partially trainable neural networks, which freeze a portion of the model parameters during the entire training process, to reduce the communication cost with little implications on model performance. Through extensive experiments, we empirically show that Federated learning of Partially Trainable neural networks (FedPT) can result in good communication-accuracy trade-offs, with up to $46\times$ reduction in communication cost, at a small accuracy cost. The proposed FedPT can be particularly interesting for pushing the limitations of overparameterization for on-device learning.
View details
Preview abstract
We present neural machine translation (NMT) models inspired by echo state network (ESN), named Echo State NMT (ESNMT), in which the encoder and decoder layer weights are randomly generated then fixed throughout training. We show that even with this extremely simple model construction and training procedure, ESNMT can already reach 70-80\% performance of fully trainable baselines. We examine how spectral radius of the reservoir, a key quantity that characterizes the model, determines the model behavior. Our findings indicate that randomized networks can work well even for complicated sequence-to-sequence prediction NLP tasks.
View details
Preview abstract
Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
View details
Preview abstract
We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.
View details
Generating diverse and natural text-to-speech samples using quantized fine-grained VAE and autoregressive prosody prior
Guangzhi Sun
Yu Zhang
ICASSP (2020)
Preview abstract
Recently proposed approaches for fine-grained prosody control of end-to-end text-to-speech samples enable precise control of the prosody of synthesized speech.
Such models incorporate a fine-grained variational autoencoder (VAE) structure into a sequence-to-sequence model, extracting latent prosody features for each input token (e.g.\ phonemes).
Generating samples using the standard VAE prior, an independent Gaussian at each time step, results in very unnatural and discontinuous speech, with dramatic variation between phonemes.
In this paper we propose a sequential prior in the discrete latent space which can be used to generate more natural samples.
This is accomplished by discretizing the latent prosody features using vector quantization, and training an autoregressive (AR) prior model over the result.
The AR prior is learned separately from the training of the posterior.
We evaluate the approach using subjective listening tests, objective metrics of automatic speech recognition (ASR) performance, as well as measurements of prosody attributes including volume, pitch, and phoneme duration.
Compared to the fine-grained VAE baseline, the proposed model achieves equally good copy synthesis reconstruction performance, but significantly improves naturalness in sample generation.
The diversity of the prosody in random samples better matches that of the real speech.
Furthermore, initial experiments demonstrate that samples generated from the quantized latent sapce can be used as an effective data augmentation strategy to improve ASR performance.
View details
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Naveen Ari
ACL 2020 (2020)
Preview abstract
Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation.
View details
Preview abstract
Bayesian inference in the parameter space of deep neural networks can be approximated by Gaussian processes (GPs). While the exact kernels of these GPs are known for a class of models, computation for competitive architectures are often expensive or intractable. One can obtain approximation of these kernels through Monte-Carlo estimation using finite networks at initialization. Monte-Carlo neural network Gaussian process (NNGP) training and inference are orders-of-magnitude cheaper in FLOPs compared to the gradient-based counter-parts when the dataset size is small. Since NNGP inference provides a cheap measure of performance of the network, we investigate its potential as a signal for neural architecture search (NAS). We compute the NNGP performance of approximately 423k networks in the NAS-bench 101 dataset on CIFAR-10 and compare its utility against conventional performance measures obtained by shortened gradient-based training. We carry out a similar analysis on 10k randomly sampled networks in the mobile neural architecture search (MNAS) space for ImageNet. We discover comparative advantages of NNGP-based metrics, and discuss potential applications. In particular, we propose that NNGP performance is an inexpensive signal independent of metrics obtained from training that can either be used for reducing big search spaces, or improving training-based performance measures.
View details
Deciphering Undersegmented Ancient Scripts Using Phonetic Prior
Jiaming Luo
Frederik Hartmann
Enrico Santus
Regina Barzilay
Transactions of the Association for Computational Linguistics (2020) (to appear)
Preview abstract
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a neural decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonetic geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.
View details
Gmail Smart Compose: Real-Time Assisted Writing
Jackie Tsay
Justin Lu
Shuyuan Zhang
Tim Sohn
Yinan Wang
KDD 2019 (2019)
Preview abstract
In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing. In the design and deployment of such a large-scale and complicated system, we faced several challenges including model selection, performance evaluation, serving and other practical issues. At the core of Smart Compose is a large-scale neural language model. We leveraged state-of-the-art machine learning techniques for language model training which enabled high-quality suggestion prediction, and constructed novel serving infrastructure for high-throughput and real-time inference. Experimental results show the effectiveness of our proposed system design and deployment approach. This system is currently being served in Gmail.
View details
Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B
Preview
Jiaming Luo
Regina Barzilay
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2019), 3146–3155
Hierarchical Generative Modeling for Controllable Speech Synthesis
Wei-Ning Hsu
Yu Zhang
Yuxuan Wang
Ye Jia
Jonathan Shen
Patrick Nguyen
Ruoming Pang
International Conference on Learning Representations (2019)
Preview abstract
This paper proposes a neural end-to-end text-to-speech model which can control latent attributes in the generation of speech, that are rarely annotated in the training data (e.g. speaking styles, accents, background noise level, and recording conditions). The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation of the proposed model demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.
View details
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Dmitry (Dima) Lepikhin
George Foster
Maxim Krikun
Naveen Ari
(2019)
Preview abstract
We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to the quality and practicality towards universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research.
View details
Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
Ye Jia
Chung-Cheng Chiu
Naveen Ari
Stella Marie Laurenzo
ICASSP (2019)
Preview abstract
End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including
lowered inference latency and the avoidance of error compounding.
However, the quality of end-to-end ST is often limited by a paucity
of training data, since it is difficult to collect large parallel corpora of
speech and translated transcript pairs. Previous studies have proposed
the use of pre-trained components and multi-task learning in order
to benefit from weakly supervised training data, such as speech-totranscript or text-to-foreign-text pairs. In this paper, we demonstrate
that using pre-trained MT or text-to-speech (TTS) synthesis models
to convert weakly supervised data into speech-to-translation pairs for
ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can
be trained using only weakly supervised datasets, and that synthetic
data sourced from unlabeled monolingual text or speech can be used
to improve performance. Finally, we discuss methods for avoiding
overfitting to synthetic speech with a quantitative ablation study.
View details
Preview abstract
While current state-of-the-art NMT models, both LSTM based and Transformers, are much deeper compared to their early counterparts, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper transformer and BiLSTM encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in significant improvements on the benchmark WMT'14 English-German and WMT'15 Czech-English tasks for both architectures.
View details
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Mike Schuster
Mohammad Norouzi
Maxim Krikun
Qin Gao
Apurva Shah
Xiaobing Liu
Łukasz Kaiser
Stephan Gouws
Taku Kudo
Keith Stevens
George Kurian
Nishant Patil
Wei Wang
Jason Smith
Alex Rudnick
Macduff Hughes
CoRR, vol. abs/1609.08144 (2016)
Preview abstract
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
View details
No Results Found