 
                Quoc V. Le
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
        
          
              Preview abstract
          
          
              Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.
              
  
View details
          
        
      
    
        
          
            
              Self-Consistency Improves Chain of Thought Reasoning in Language Models
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Jason Wei
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sharan Narang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Aakanksha Chowdhery
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            ICLR 2023 (to appear)
          
          
        
        
        
          
              Preview abstract
          
          
              Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
              
  
View details
          
        
      
    
        
          
            
              The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Shayne Longpre
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Le Hou
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Albert Webson
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Hyung Won Chung
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yi Tay
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Barret Zoph
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jason Wei
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Proceedings of the 40th International Conference on Machine Learning, PMLR (2023), pp. 22631-22648
          
          
        
        
        
          
              Preview abstract
          
          
              We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
              
  
View details
          
        
      
    
        
          
            
              Noise2Music: Text-conditioned Music Generation with Diffusion Models
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Qingqing Huang
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Daniel S. Park
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Tao Wang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Timo Denk
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Nanxin Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhengdong Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhishuai Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jiahui Yu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Christian Frank
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        William Chan
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhifeng Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Han
                      
                    
                  
              
            
          
          
          
          
             (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
              
  
View details
          
        
      
    
        
          
            
              LaMDA: Language Models for Dialog Applications
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Aaron Daniel Cohen
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alena Butryna
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alicia Jin
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Apoorv Kulshreshtha
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ben Zevenbergen
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chung-ching Chang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Cosmo Du
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel De Freitas Adiwardana
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dehao Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dmitry (Dima) Lepikhin
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Erin Hoffman-John
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Igor Krivokon
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        James Qin
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jamie Hall
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Joe Fenton
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Johnny Soraker
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Kathy Meier-Hellstern
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Maarten Paul Bosma
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marc Joseph Pickett
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marcelo Amorim Menegali
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marian Croak
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Maxim Krikun
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Noam Shazeer
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Rachel Bernstein
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ravi Rajakumar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ray Kurzweil
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Romal Thoppilan
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Steven Zheng
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Taylor Bos
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Toju Duke
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Tulsee Doshi
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vincent Y. Zhao
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Will Rusch
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yanping Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuanzhong Xu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhifeng Chen
                      
                    
                  
              
            
          
          
          
          
            arXiv (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text.  While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
              
  
View details
          
        
      
    
        
          
            
              The Carbon Footprint of Machine Learning Training Will Level Out and Then Reduce
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Chen Liang
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        David Richard So
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Lluis-Miquel Munguia
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Maud Texier
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            IEEE Computer (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              Many recent papers highlight the importance of thinking about carbon emissions (CO2e) in machine learning (ML) workloads. While elevating the discussion, some early work was also based on incomplete information. (Unfortunately, the most widely cited quantitative estimate that was the basis for many of these papers was off by 88X.) Inspired by these concerns, we looked for approaches that would make ML training considerably less carbon intensive. We identified four best practices that dramatically reduce carbon emissions, and demonstrate two concrete examples of reducing CO2e by 650X over four years and 40X over one year by following them. Provided ML stakeholders follow best practices, we predict that the field will bend the curve of carbon footprint increases from ML training runs to first flatten and then reduce it by 2030 without sacrificing the current rate of rapid advances in ML, contrary to prior dire warnings that ML CO2e will soar.
              
  
View details
          
        
      
    
        
          
            
              Sparsely Activated Language Models are Efficient In-Context Learners
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Andrew Dai
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Barret Richard Zoph
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dmitry (Dima) Lepikhin
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Emma Wang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Kathy Meier-Hellstern
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Kun Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Liam B. Fedus
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Maarten Paul Bosma
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marie Pellat
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Maxim Krikun
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Nan Du
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Simon Tong
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Tao Wang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Toju Duke
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yanping Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yonghui Wu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuanzhong Xu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhifeng Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zongwei Zhou
                      
                    
                  
              
            
          
          
          
          
             (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
              
  
View details
          
        
      
    
        
          
            
              TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Gabriel M. Bender
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Hanxiao Liu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Madeleine Udell
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yifeng Lu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Da Huang
                      
                    
                  
              
            
          
          
          
          
            Neural Information Processing Systems (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.
              
  
View details
          
        
      
    
        
          
            
              Finetuned Language Models are Zero-Shot Learners
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Jason Wei
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Maarten Paul Bosma
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vincent Zhao
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Nan Du
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Andrew Mingbo Dai
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            International Conference on Learning Representations (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              This paper explores a simple method for improving the zero-shot learning abilities of language models. 
We show that instruction tuning---finetuning language models on a collection of tasks described via instructions---substantially boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning.
              
  
View details
          
        
      
    
        
          
            
              Searching for Fast Models on Datacenter Accelerators
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Ruoming Pang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Andrew Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Norm Jouppi
                      
                    
                  
              
            
          
          
          
          
            Conference on Computer Vision and Pattern Recognition (2021)
          
          
        
        
        
          
              Preview abstract
          
          
              Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families
for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency and exhibit FLOPs-latency nonproportionality. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. We further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from the observations from previous compound scaling. With the new search space
and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/tpu
              
  
View details
          
        
      
    