Jump to Content
Yuan Zhang

Yuan Zhang

I'm a Staff Research Scientist at Google where I develop new machine learning methods for products to enable new exciting capabilities. My area of interests include UI control with LLMs, multimodality, retrieval-based and retrieval-augmented models, semantic parsing.

I led collaborations with Assistant and Bard. Our work on retrieval-based models won the 2022 Assistant Tech Impact Awards.

Before coming to Google, I completed my Ph.D. at Massachusetts Institute of Technology, where my advisor was Regina Barzilay. Most of my Ph.D. work has been focused on algorithms for syntactic parsing and transfer learning. Before that I earned my Bachelor degree in Computer Science from Tsinghua University.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In practical applications of semantic parsing, we occasionally want to control the behavior of the parser, such as making it output meaning representations in a new domain, or influencing the prediction on some queries toward certain patterns. While it is possible to fine-tune the parser on examples exhibiting the target behavior, a method that does not consume as much time or computation resources would be preferable. To this end, we propose retrieval-augmented generative semantic parser (RAG-SP): given the input query, the parser retrieves relevant information from the retrieval index, augment it to the query, and then apply a generative model to produce an output. The augmented information acts as a soft influence on the generative model, and by manipulating the retrieval index or how the augmented query is constructed, we can manipulate the behavior of the parser. On the MTOP dataset, in addition to achieving state-of-the-art on the standard setup, we show that RAG-SP can parse queries in a new domain or adapt the prediction toward the specified patterns without having to fine-tune the model. With some modifications, RAG-SP also performs well on the episodic few-shot setup on the SNIPS slot tagging dataset. View details
    Preview abstract Slot-filling is an essential component for building task-oriented dialog systems. In this work, we focus on the zero-shot slot-filling (ZSSF) problem, where the model needs to predict slots and their values given utterances from new domains with zero training data. Prior methods for ZSSF directly learn representations for slots descriptions and utterances for extracting slot fillers. However, there are ambiguity and loss of information in encoding the raw slot description, which can hurt the models' zero-shot capacity. To address this problem, we introduce QA-driven slot filling (QASF), which extracts slot-filler spans from utterances with a span-based QA model. We use a linguistically motivated questioning strategy for turning the descriptions into questions, allowing the model to generalize to unseen slot types. Furthermore, our QASF model better utilizes weak supervision signals from QA pairs synthetically generated from conversations. View details
    Preview abstract Pre-trained seq2seq models are prevalent in semantic parsing, but have been found to struggle at out-of-distribution compositional generalization. In contrast, specialized model architectures have been proposed to address this issue, often at the cost of generality and in-distribution performance. In this paper, we propose a simple strategy to unlock compositionality of pre-trained seq2seq models through intermediate representations, without changing the model architectures at all. We identify several effective strategies for designing reversible and lossy intermediate representations that reduce the structural mismatch between inputs and outputs. We then apply either deterministic transformations or a second seq2seq to map the intermediate form to the original executable form. We find that the combination of our proposed transformations and pre-trained models is surprisingly effective, obtaining a new state-of-the-art on CFQ (+11.9 accuracy points) and on the template-splits of three text-to-SQL datasets (+15.0 to +19.4 accuracy points). This work highlights that intermediate representations provide an important (and potentially overlooked) degree of freedom for improving the compositional generalization abilities of pre-trained seq2seq models. View details
    Preview abstract We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp. View details
    Few-shot Slot Filling and Intent Classification with Retrieved Examples
    Dian Yu
    Ice Pasupat
    Qi Li
    Xinya Du
    2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2020)
    Preview abstract Few-shot learning is an important problem in natural language understanding tasks due to scenarios such as inclusion of new domains and labels. In this paper, we explore retrieval-based methods for tackling the few-shot intent classification and slot filling tasks due to their advantage of 1) better adaptation to new domains; and 2) not requiring model retraining with new labels. However, structured prediction beyond intent classification is challenging for retrieval-based methods. In this work, we propose a span-level retrieval method by learning similar contextualized representations for spans with the same label. At inference time, we use the labels of the retrieved spans to construct the final structure. We show that our method outperforms previous systems in the few-shot setting on the CLINC and SNIPS benchmarks. View details
    Large-scale representation learning from visually grounded untranscribed speech
    Gabriel Ilharco Magalhaes
    Proceedings of the Conference on Natural Language Learning (2019)
    Preview abstract Systems that learn from associating images with their spoken audio captions are an important step towards visually grounded language acquisition. We describe a scalable method of automatically generating diverse audio data from image caption datasets. This supports pre-training deep networks for encoding both audio and images, by training a dual encoder that learns to align latent representations of both modalities. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art retrieval results---improving retrieval in the top 10 from 29.6\% to 49.5\%. We additionally obtain human ratings on model outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, and find that strict corpus based evaluation substantially underestimates the quality of the retrieved results. View details
    Preview abstract Most existing work on adversarial data generation focuses only on English. For example, the PAWS (Paraphrase Adversaries from Word Scrambling) dataset consists of English examples for challenging paraphrase identification from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 \emph{human} translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and structural word interaction, and using different multilingual training and evaluation regimes. The multilingual BERT model fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23\% absolute over the best competing model. As such, PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenging benchmark to drive multilingual research that better captures structure and contextual information. View details
    Preview abstract Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a new dataset with 108,463 wellformed paraphrase and non-paraphrase pairs with high lexical overlap. Challenging pairs are generated by controlled word swapping and back translation, followed by fluency and paraphrase judgments by human raters. State-of-the-art models trained on existing datasets have dismal performance on PAWS (<40% accuracy); however, including PAWS training data for these models improves their accuracy to 85% while maintaining performance on existing tasks. In contrast, models that do not capture non-local contextual information fail even with PAWS training examples. As such, PAWS provides an effective instrument for driving further progress on models that better exploit structure, context, and pairwise comparisons. View details
    Preview abstract We address the problem of fine-grained multilingual language identification: providing a language code for every token in a sentence, including codemixed text containing multiple languages. Such text is increasingly prevalent online, in documents, social media, and message boards. In this paper, we show that a feed-forward network with a simple globally constrained decoder can accurately and rapidly label both codemixed and monolingual text in 100 languages and 100 language pairs. This model outperforms previously published multilingual approaches in terms of both accuracy and speed, yielding an 800x speed-up and a 19.2% averaged absolute gain on three codemixed datasets. View details
    Points, Paths, and Playscapes: Large-scale Spatial Language Understanding Tasks Set in the Real World
    Daphne Luong
    Proceedings of the First International Workshop on Spatial Language Understanding, Association for Computational Linguistics, New Orleans, Louisiana, USA (2018), pp. 46-52
    Preview abstract Spatial language understanding is important for practical applications and as a building block for better abstract language understanding. Much progress has been made through work on understanding spatial relations and values in images and texts as well as on giving and following navigation instructions in restricted domains. We argue that the next big advances in spatial language understanding can be best supported by creating large-scale datasets that focus on points and paths based in the real world, and then extending these to create online, persistent playscapes that mix human and bot players. The bot players can begin play having undergone a prior training regime, but then must learn, evolve, and survive according to their depth of understanding of scenes, navigation, and interactions. View details
    Following Formulaic Map Instructions in a Street Simulation Environment
    Volkan Cirik
    Visually Grounded Interaction and Language Workshop (ViGIL) (2018)
    Preview abstract We introduce a task and a learning environment for following navigational instructions in Google Street View. We sample ∼100k routes in 100 regions in 10 U.S cities. For each route, we obtain navigation instructions, build a connected graph of locations and the real-world images available at each location, and extract visual features. Evaluation of existing models shows that this setting offers a challenging benchmark for agents navigating with the help of language cues in real-world outdoor locations. They also highlight the need to have start-of-path orientation descriptions and end-of-path goal descriptions as well as route descriptions. View details
    No Results Found