Yuan Zhang
I'm a Staff Research Scientist at Google where I develop new machine learning methods for products to enable new exciting capabilities. My area of interests include UI control with LLMs, multimodality, retrieval-based and retrieval-augmented models, semantic parsing.
I led collaborations with Assistant and Bard. Our work on retrieval-based models won the 2022 Assistant Tech Impact Awards.
Before coming to Google, I completed my Ph.D. at Massachusetts Institute of Technology, where my advisor was Regina Barzilay. Most of my Ph.D. work has been focused on algorithms for syntactic parsing and transfer learning. Before that I earned my Bachelor degree in Computer Science from Tsinghua University.
Research Areas
Authored Publications
Sort By
Preview abstract
Pre-trained seq2seq models are prevalent in semantic parsing, but have been found to struggle at out-of-distribution compositional generalization. In contrast, specialized model architectures have been proposed to address this issue, often at the cost of generality and in-distribution performance.
In this paper, we propose a simple strategy to unlock compositionality
of pre-trained seq2seq models through intermediate representations,
without changing the model architectures at all. We identify several effective strategies for designing reversible and lossy intermediate representations that reduce the structural mismatch between inputs and outputs. We then apply either deterministic transformations or a second seq2seq to map the intermediate form to the original executable form.
We find that the combination of our proposed transformations and pre-trained models is surprisingly effective, obtaining a new state-of-the-art on CFQ (+11.9 accuracy points) and on the template-splits of three text-to-SQL datasets (+15.0 to +19.4 accuracy points).
This work highlights that intermediate representations provide an important (and potentially overlooked) degree of freedom for improving the compositional generalization abilities of pre-trained seq2seq models.
View details
Preview abstract
In practical applications of semantic parsing, we occasionally want to control the behavior of the parser, such as making it output meaning representations in a new domain, or influencing the prediction on some queries toward certain patterns. While it is possible to fine-tune the parser on examples exhibiting the target behavior, a method that does not consume as much time or computation resources would be preferable. To this end, we propose retrieval-augmented generative semantic parser (RAG-SP): given the input query, the parser retrieves relevant information from the retrieval index, augment it to the query, and then apply a generative model to produce an output. The augmented information acts as a soft influence on the generative model, and by manipulating the retrieval index or how the augmented query is constructed, we can manipulate the behavior of the parser. On the MTOP dataset, in addition to achieving state-of-the-art on the standard setup, we show that RAG-SP can parse queries in a new domain or adapt the prediction toward the specified patterns without having to fine-tune the model. With some modifications, RAG-SP also performs well on the episodic few-shot setup on the SNIPS slot tagging dataset.
View details
Preview abstract
Slot-filling is an essential component for building task-oriented dialog systems. In this work, we focus on the zero-shot slot-filling (ZSSF) problem, where the model needs to predict slots and their values given utterances from new domains with zero training data. Prior methods for ZSSF directly learn representations for slots descriptions and utterances for extracting slot fillers. However, there are ambiguity and loss of information in encoding the raw slot description, which can hurt the models' zero-shot capacity. To address this problem, we introduce QA-driven slot filling (QASF), which extracts slot-filler spans from utterances with a span-based QA model. We use a linguistically motivated questioning strategy for turning the descriptions into questions, allowing the model to generalize to unseen slot types. Furthermore, our QASF model better utilizes weak supervision signals from QA pairs synthetically generated from conversations.
View details
Preview abstract
We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.
View details
Few-shot Slot Filling and Intent Classification with Retrieved Examples
Dian Yu
Ice Pasupat
Qi Li
Xinya Du
2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2020)
Preview abstract
Few-shot learning is an important problem in natural language understanding tasks due to scenarios such as inclusion of new domains and labels. In this paper, we explore retrieval-based methods for tackling the few-shot intent classification and slot filling tasks due to their advantage of 1) better adaptation to new domains; and 2) not requiring model retraining with new labels.
However, structured prediction beyond intent classification is challenging for retrieval-based methods. In this work, we propose a span-level retrieval method by learning similar contextualized representations for spans with the same label. At inference time, we use the labels of the retrieved spans to construct the final structure. We show that our method outperforms previous systems in the few-shot setting on the CLINC and SNIPS benchmarks.
View details
Preview abstract
Most existing work on adversarial data generation focuses only on English. For example, the PAWS (Paraphrase Adversaries from Word Scrambling) dataset consists of English examples for challenging paraphrase identification from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 \emph{human} translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and structural word interaction, and using different multilingual training and evaluation regimes. The multilingual BERT model fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23\% absolute over the best competing model. As such, PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenging benchmark to drive multilingual research that better captures structure and contextual information.
View details
Preview abstract
Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a new dataset with 108,463 wellformed paraphrase and non-paraphrase pairs with high lexical overlap. Challenging pairs are generated by controlled word swapping and back translation, followed by fluency and paraphrase judgments by human raters. State-of-the-art models trained on existing datasets have dismal performance on PAWS (<40% accuracy); however, including PAWS training data for these models improves their accuracy to 85% while maintaining performance on existing tasks. In contrast, models that do not capture non-local contextual information fail even with PAWS training examples. As such, PAWS provides an effective instrument for driving further progress on models that better exploit structure, context, and pairwise comparisons.
View details
Large-scale representation learning from visually grounded untranscribed speech
Gabriel Ilharco Magalhaes
Proceedings of the Conference on Natural Language Learning (2019)
Preview abstract
Systems that learn from associating images with their spoken audio captions are an important step towards visually grounded language acquisition. We describe a scalable method of automatically generating diverse audio data from image caption datasets. This supports pre-training deep networks for encoding both audio and images, by training a dual encoder that learns to align latent representations of both modalities. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art retrieval results---improving retrieval in the top 10 from 29.6\% to 49.5\%. We additionally obtain human ratings on model outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, and find that strict corpus based evaluation substantially underestimates the quality of the retrieved results.
View details
Preview abstract
We address the problem of fine-grained multilingual language identification: providing a language code for every token in a sentence, including codemixed text containing multiple languages. Such text is increasingly prevalent online, in documents, social media, and message boards. In this paper, we show that a feed-forward network with a simple globally constrained decoder can accurately and rapidly label both codemixed and monolingual text in 100 languages and 100 language pairs. This model outperforms previously published multilingual approaches in terms of both accuracy and speed, yielding an 800x speed-up and a 19.2% averaged absolute gain on three codemixed datasets.
View details
Following Formulaic Map Instructions in a Street Simulation Environment
Volkan Cirik
Visually Grounded Interaction and Language Workshop (ViGIL) (2018)
Preview abstract
We introduce a task and a learning environment for following navigational instructions in Google Street View. We sample ∼100k routes in 100 regions in 10 U.S cities. For each route, we obtain navigation instructions, build a connected graph of locations and the real-world images available at each location, and extract visual features. Evaluation of existing models shows that this setting offers a challenging benchmark for agents navigating with the help of language cues in real-world outdoor locations. They also highlight the need to have start-of-path orientation descriptions and end-of-path goal descriptions as well as route descriptions.
View details