Quoc V. Le

Quoc V. Le

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Solving olympiad geometry without human demonstrations
    Trieu Trinh
    Yuhuai Tony Wu
    He He
    Nature, 625 (2024), pp. 476-482
    Preview abstract Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004. View details
    Noise2Music: Text-conditioned Music Generation with Diffusion Models
    Qingqing Huang
    Daniel S. Park
    Tao Wang
    Nanxin Chen
    Zhengdong Zhang
    Zhishuai Zhang
    Jiahui Yu
    Christian Frank
    William Chan
    Wei Han
    (2023)
    Preview abstract We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. View details
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
    Shayne Longpre
    Le Hou
    Tu Vu
    Albert Webson
    Hyung Won Chung
    Yi Tay
    Barret Zoph
    Jason Wei
    Proceedings of the 40th International Conference on Machine Learning, PMLR (2023), pp. 22631-22648
    Preview abstract We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2. View details
    Preview abstract Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%). View details
    LaMDA: Language Models for Dialog Applications
    Aaron Daniel Cohen
    Alena Butryna
    Alicia Jin
    Apoorv Kulshreshtha
    Ben Zevenbergen
    Chung-ching Chang
    Cosmo Du
    Daniel De Freitas Adiwardana
    Dehao Chen
    Dmitry (Dima) Lepikhin
    Erin Hoffman-John
    Igor Krivokon
    James Qin
    Jamie Hall
    Joe Fenton
    Johnny Soraker
    Kathy Meier-Hellstern
    Maarten Paul Bosma
    Marc Joseph Pickett
    Marcelo Amorim Menegali
    Marian Croak
    Maxim Krikun
    Noam Shazeer
    Rachel Bernstein
    Ravi Rajakumar
    Ray Kurzweil
    Romal Thoppilan
    Steven Zheng
    Taylor Bos
    Toju Duke
    Tulsee Doshi
    Vincent Y. Zhao
    Will Rusch
    Yuanzhong Xu
    arXiv (2022)
    Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
    Preview abstract Many recent papers highlight the importance of thinking about carbon emissions (CO2e) in machine learning (ML) workloads. While elevating the discussion, some early work was also based on incomplete information. (Unfortunately, the most widely cited quantitative estimate that was the basis for many of these papers was off by 88X.) Inspired by these concerns, we looked for approaches that would make ML training considerably less carbon intensive. We identified four best practices that dramatically reduce carbon emissions, and demonstrate two concrete examples of reducing CO2e by 650X over four years and 40X over one year by following them. Provided ML stakeholders follow best practices, we predict that the field will bend the curve of carbon footprint increases from ML training runs to first flatten and then reduce it by 2030 without sacrificing the current rate of rapid advances in ML, contrary to prior dire warnings that ML CO2e will soar. View details
    Finetuned Language Models are Zero-Shot Learners
    Jason Wei
    Maarten Paul Bosma
    Vincent Zhao
    Nan Du
    International Conference on Learning Representations (2022)
    Preview abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning---finetuning language models on a collection of tasks described via instructions---substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning. View details
    Sparsely Activated Language Models are Efficient In-Context Learners
    Barret Richard Zoph
    Dmitry (Dima) Lepikhin
    Emma Wang
    Kathy Meier-Hellstern
    Kun Zhang
    Liam B. Fedus
    Maarten Paul Bosma
    Marie Pellat
    Maxim Krikun
    Nan Du
    Simon Tong
    Tao Wang
    Toju Duke
    Yuanzhong Xu
    Zongwei Zhou
    (2022)
    Preview abstract Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3. View details
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
    Gabriel M. Bender
    Hanxiao Liu
    Madeleine Udell
    Yifeng Lu
    Da Huang
    Neural Information Processing Systems (2022)
    Preview abstract The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints. View details
    AutoHAS: Efficient Hyperparameter and Architecture Search
    Xuanyi Dong
    Daiyi Peng
    Bogdan Gabrys
    Workshop on Neural Architecture Search at International Conference on Learning Representations (NAS@ICLR) (2021)
    Preview abstract Efficient hyperparameter or architecture search methods have shown remarkable results, but each of them is only applicable to searching for either hyperparameters (HPs) or architectures. In this work, we propose a unified pipeline, AutoHAS, to efficiently search for both architectures and hyperparameters. AutoHAS learns to alternately update the shared network weights and a reinforcement learning (RL) controller, which learns the probability distribution for the architecture candidates and HP candidates. A temporary weight is introduced to store the updated weight from the selected HPs (by the controller), and a validation accuracy based on this temporary weight serves as a reward to update the controller. In experiments, we show AutoHAS is efficient and generalizable to different search spaces, baselines and datasets. In particular, AutoHAS can improve the accuracy over popular network architectures, such as ResNet and EfficientNet, on CIFAR-10/100, ImageNet, and four more other datasets. View details