Yanqi Zhou
Yanqi Zhou is a research scientist at Google Brain, Mountain View, working with James Laudon. She pursued her Ph.D. degree at Princeton University, advised by David Wentzlaff. During her Ph.D. study (2011-2017), she also collaborated extensively with Doug Burger and Karin Strauss at Microsoft Research. She obtained her bachelor degree from the University of Michigan (2009-2011), and Shanghai Jiao Tong (2007-2009). Her research interest lies in computer systems and machine learning. More specifically, Yanqi applies machine learning to design more efficient computer systems and builds large-scale deep learning models for speech and language tasks.
Authored Publications
Sort By
Learning Large Graph Property Prediction via Graph Segment Training
Kaidi Cao
Mangpo Phothilimthana
Charith Mendis
Jure Leskovec
Advances in Neural Information Processing Systems (2023)
Preview abstract
Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.
View details
Sparsely Activated Language Models are Efficient In-Context Learners
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kathy Meier-Hellstern
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yuanzhong Xu
Zongwei Zhou
(2022)
Preview abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
View details
Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs
Anton Spiridonov
Hao Xu
Marie Charisse White
Ping Zhou
Suyog Gupta
Yun Long
Zhuo Wang
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)
Preview abstract
On-device ML accelerators are becoming a standard in modern mobile system-on-chips (SoC).
Neural architecture search (NAS) comes to the rescue for efficiently utilizing the high compute throughput offered by these accelerators. However, existing NAS frameworks have several practical limitations
in scaling to multiple tasks and different target platforms.
In this work, we provide a two-pronged approach to this challenge:
(i) a NAS-enabling infrastructure that decouples model cost evaluation, search space design, and the NAS algorithm to rapidly target various on-device ML tasks, and
(ii) search spaces crafted from group convolution based inverted bottleneck (IBN) variants that provide flexible quality/performance trade-offs on ML accelerators,
complementing the existing full and depthwise convolution based IBNs.
Using this approach we target a state-of-the-art mobile platform, Google Tensor SoC,
and demonstrate neural architectures that improve the quality-performance pareto frontier for various computer vision (classification, detection, segmentation) as well as natural language processing tasks.
View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yuanzhong Xu
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
Berkin Ilbeyi
Bjarke Roune
Blake Hechtman
Emma Wang
Karthik Srinivasa Murthy
Mangpo Phothilimthana
Mike Burrows
Nikhil Sarda
Rezsa Farahani
Samuel J. Kaufman
Shen Wang
Sudip Roy
Yuanzhong Xu
PACT (2021)
Preview abstract
Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search is applied. This decomposition is not only non-trivial but also significantly limits the scope of optimization. Second, prior works require search to be applied in a single stage in the compilation flow, which does not fit with the multi-stage layered architecture of most production ML compilers.
This paper presents XTAT, an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations across multiple compilation stages. XTAT applies XTAT-M, a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. XTAT tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs) at Google, XTAT offers up to 2.4x and an average 5% execution time speedup over the heavily-optimized XLA compiler.
View details
A Learned Performance Model for Tensor Processing Units
Charith Mendis
Mangpo Phothilimthana
Mike Burrows
Samuel J. Kaufman
Sudip Roy
MLSys (2021)
Preview abstract
Accurate hardware performance models are critical to efficient code generation.
They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program.
However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden.
We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks---tile-size selection and operator fusion---and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
View details
Preview abstract
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as an minimization objective, or by autotuners to find an optimal configuration of a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for the Tensor Processing Unit (TPU). We train a neural network over kernel-level sub-graphs from the corpus and find that the learned model is competitive to a heavily-optimized analytical cost model used in the production XLA compiler.
View details
Apollo: Transferable Architecture Exploration
Albin Jones
Ravi Narayanaswami
Sat Chatterjee
ML for Systems Workshop at NeurIPS 2020
Preview abstract
The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures.
Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design.
View details
Graph Transformer: A Generalized Method for Computation Graph Optimizations
Amirali Abdolrashidi
Azalia Mirhoseini
Daniel Wong
Hanxiao Liu
Mangpo Phothilimthana
Qiumin Xu
Shen Wang
Sudip Roy
(2020)
Preview abstract
Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems.
View details
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Michael Matena
Noam Shazeer
Peter J. Liu
Sharan Narang
Wei Li
Google (2019)
Preview abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a lower-resource downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning for NLP by introducing a unified framework which casts every language problem as a text-to-text task. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of text understanding tasks. By combining the insights gained in our exploration with scale and a new giant unlabeled text dataset, we achieve state-of-the-art results in most of the tasks we consider. To facilitate future work on text understanding, we release our dataset, pre-trained models, and code.
View details