Jacob Austin

Jacob Austin is an AI Resident in Google Research on the Brain Team, working on program synthesis and probabilistic modeling. His work at Google has focused on program synthesis using large neural network language models in real-world programming languages and diffusion models for discrete state spaces.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Resolving Code Review Comments with Machine Learning
    Alexander Frömmgen
    Peter Choy
    Elena Khrapko
    Marcus Revaj
    2024 IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (to appear)
    Preview abstract Code reviews are a critical part of the software development process, taking a significant amount of the code authors’ and the code reviewers’ time. As part of this process, the reviewer inspects the proposed code and asks the author for code changes through comments written in natural language. At Google, we see millions of reviewer comments per year, and authors require an average of ∼60 minutes active shepherding time between sending changes for review and finally submitting the change. In our measurements, the required active work time that the code author must devote to address reviewer comments grows almost linearly with the number of comments. However, with machine learning (ML), we have an opportunity to automate and streamline the code-review process, e.g., by proposing code changes based on a comment’s text. We describe our application of recent advances in large sequence models in a real-world setting to automatically resolve code-review comments in the day-to-day development workflow at Google. We present the evolution of this feature from an asynchronous generation of suggested edits after the reviewer sends feedback, to an interactive experience that suggests code edits to the reviewer at review time. In deployment, code-change authors at Google address 7.5% of all reviewer comments by applying an ML-suggested edit. The impact of this will be to reduce the time spent on code reviews by hundreds of thousands of engineer hours annually at Google scale. Unsolicited, very positive feedback highlights that the impact of ML-suggested code edits increases Googlers’ productivity and allows them to focus on more creative and complex tasks. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Show Your Work: Scratchpads for Intermediate Computation with Language Models
    Maxwell Nye
    Guy Gur-Ari
    Henryk Witold Michalewski
    David Martin Dohan
    Aitor Lewkowycz
    Maarten Paul Bosma
    David Luan
    Augustus Odena
    (2021)
    Preview abstract Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text (Brown et al., 2020) or synthesizing computer programs (Chen et al., 2021; Austin et al., 2021). However, they struggle with tasks that require unbounded multi-step computation, such as adding integers (Brown et al., 2020) or executing programs (Austin et al., 2021). Surprisingly, we find that these same models are able to perform complex multistep computations—even in the few-shot regime—when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train Transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations. View details
    Beyond In-Place Corruption: Insertion and Deletion In Denoising Probabilistic Models
    Rianne van den Berg
    ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (2021)
    Preview abstract Denoising diffusion probabilistic models have shown impressive results for generation of sequences by iteratively corrupting each example and then learning to map corrupted versions back to the original. However, previous work has largely focused on in-place corruption, adding noise to each pixel or token individually while keeping their locations the same. In this work, we consider a broader class of corruption processes and denoising models over sequence data that can insert and delete elements, while still being efficient to train and sample from. We demonstrate that these models outperform standard in-place models on an arithmetic sequence task, and that when trained on the text8 dataset they can be used to fix spelling errors without any fine-tuning. View details
    Structured Denoising Diffusion Models in Discrete State-Spaces
    Jonathan Ho
    Rianne van den Berg
    Advances in Neural Information Processing Systems (2021) (to appear)
    Preview abstract Denoising diffusion probabilistic models (DDPMs) [Ho et al. 2021] have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models of discrete data that generalize the multinomial diffusion model of Hoogeboom et al. [2021] by going beyond transition matrices with uniform transition probabilities. This includes discrete transition matrices that mimic Gaussian kernels in continuous space, kernels based on nearest neighbors in embedding space, and kernels that introduce masking. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition kernel is an important design decision that leads to improved results in image and text domains. In addition, we show that expressing transition matrices as matrix exponentials leads to efficient implementations and controllable schedules. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation and scales to LM1B with a large subword-tokenized vocabulary. On the image dataset CIFAR-10, our models achieve FID and Inception scores rivaling those of the continuous DDPM model. View details
    Program Synthesis with Large Language Models
    Augustus Odena
    David Martin Dohan
    Ellen Jiang
    Henryk Michalewski
    Maarten Paul Bosma
    Maxwell Nye
    n/a, n/a, n/a (2021), n/a
    Preview abstract Program synthesis is one of the grand challenges of artificial intelligence, but to date practical successes have focused on narrow settings and restricted domains. Large language models trained on massive corpora of web texts which include open-source code, programming websites, and tutorials have the potential to break through this barrier.This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate the performance of the language model LaMDA PT [Freitas et al.,2021] on several program synthesis tasks, at a variety of scales ranging from 244M to 137B parameters. First, we introduce a new benchmark, Mostly Basic Programming Problems (MBPP), to measure the ability of these models to synthesize short Python programs from natural language descriptions. The benchmark consists of around 1000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and automated test-cases. We also introduce a Python version of the MathQA benchmark, which evaluates the ability of the models to synthesize code from more complex text. On both datasets, we evaluate synthesis performance and find that synthesis performance scales log-linearly with model size. In contrast to some previous work, we find that LaMDAPT achieves non-negligible preformance in a few-shot setting, although fine-tuning still performs much better. Thel argest models we consider can synthesize solutions to 58% of the problems from MBPP using few-shot learning with a well-designed prompt; across model sizes, fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points. Finally, we conduct a thorough error analysis, shedding light on where these models fall short as program synthesizers, what types of programs are most difficult to generate, and how the models might be improved. As part of that analysis, we explore the semantic grounding of these models, finding that even our largest models are generally unable to predict the output of a program given a specific input. View details