Yanping Huang
Authored Publications
Sort By
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
Zhiruo Wang
Yonatan Bisk
Alex Hauptmann
Lu Jiang
NeurIPS (2023)
Preview abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
View details
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Danyang Zhuo
Hao Zhang
Ion Stoica
Joseph E. Gonzalez
Lianmin Zheng
Yida Wang
Yonghao Zhuang
Yuanzhong Xu
Zhuohan Li
16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), USENIX Association (2022), pp. 559-578
Preview abstract
Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa
View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yuanzhong Xu
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
Sparsely Activated Language Models are Efficient In-Context Learners
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yuanzhong Xu
Zongwei Zhou
(2022)
Preview abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Dmitry (Dima) Lepikhin
Maxim Krikun
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference (2021)
Preview abstract
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.
View details
GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding
Dehao Chen
Dmitry (Dima) Lepikhin
HyoukJoong Lee
Maxim Krikun
Noam Shazeer
Yuanzhong Xu
ICLR 2021 (2020) (to appear)
Preview abstract
Neural network scaling has been critical for improving the model quality in many real world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes of existing model code. It enabled us to scale up multilingual machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can be easily trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
View details
Just Pick a Sign: Reducing Gradient Conflict in Deep Networks with Gradient Sign Dropout
Drago Anguelov
Henrik Kretzschmar
Jiquan Ngiam
Yuning Chai
Zhao Chen
NeurIPS 2020 Submission (2020) (to appear)
Preview abstract
The vast majority of modern deep neural networks produce multiple gradient signals which then attempt to update the same set of scalar weights. Such updates are often incompatible with each other, leading to gradient conflicts which impede optimal network training. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which encourages backpropagation only of gradients which are mutually consistent at a given deep activation layer. GradDrop is simple to implement as a modular layer within any deepnet and is synergistic with other gradient balancing approaches. We show that GradDrop performs better than other state-of-the-art methods for two very common contexts in which gradient conflicts pose a problem: multitask learning and transfer learning.
View details
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Youlong Cheng
Dehao Chen
HyoukJoong Lee
Jiquan Ngiam
NeurIPS (2019)
Preview abstract
Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.
View details
Preview abstract
The effort devoted to hand-crafting image classifiers has motivated the use of architecture search to discover them automatically. Although evolutionary algorithms have been repeatedly applied to architecture search, the architectures thus discovered have remained inferior to human-crafted ones. Here we show for the first time that artificially-evolved architectures can match or surpass human-crafted and RL-designed image classifiers. In particular, our models---named AmoebaNets---achieved a state-of-the-art accuracy of 97.87% on CIFAR-10 and top-1 accuracy of 83.1% on ImageNet. Among mobile-size models, an AmoebaNet with only 5.1M parameters also achieved a state-of-the-art top-1 accuracy of 75.1% on ImageNet. We also compared this method against strong baselines. Finally, we performed platform-aware architecture search with evolution to find a model that trains quickly on Google Cloud TPUs. This method produced an AmoebaNet that won the Stanford DAWNBench competition for lowest ImageNet training cost.
View details