Charles Sutton
I joined Google in January 2018. My research interests span deep learning, probabilistic machine learning, programming languages, data mining, and software engineering.
I'm especially excited about applying deep learning to huge code bases, finding patterns about what makes for good code, leading to tools to help people write better software.
For older publications (back to 2002), please see my academic web site at the University of Edinburgh.
I maintain a blog with advice for researchers, reflections on academia, the research community, the expatriate lifestyle, and sillier matters.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
Identifying invariants in programs is an important program analysis task with applications towards program understanding, vulnerability analysis, and formal verification. Existing tools for identifying invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models training on source code and fine-tuned to invariant prediction can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces.
View details
Compositional Generalization and Decomposition in Neural Program Synthesis
Joey Hong
Deep Learning for Code (DL4C) Workshop at ICLR'22 (2022)
Preview abstract
When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize, e.g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data. Based on this characterization, we introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets, SCAN and RobustFill. Finally, we make first attempts to improve the compositional generalization ability of Transformer models along these axes through novel attention mechanisms that draw inspiration from a human-like decomposition strategy. Empirically, we find our modified Transformer models generally perform better than natural baselines, but the tasks remain challenging.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Michele Catasta
Jason Wei
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
CrossBeam: Learning to Search in Bottom-Up Program Synthesis
Kevin Ellis
International Conference on Learning Representations (ICLR) (2022) (to appear)
Preview abstract
Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we propose training a neural model to learn a hands-on search policy for bottom-up synthesis, instead of relying on a combinatorial search algorithm. Our approach, called CrossBeam, uses the neural model to choose how to combine previously-explored programs into new programs, taking into account the search history and partial program executions. Motivated by work in structured prediction on learning to search, CrossBeam is trained on-policy using data extracted from its own bottom-up searches on training tasks. We evaluate CrossBeam in two very different domains, string manipulation and logic programming. We observe that CrossBeam learns to search efficiently, exploring much smaller portions of the program space compared to the state-of-the-art.
View details
Graph Representations of Python Programs via Source-level Static Analysis
Vincent Josua Hellendoorn
Arxiv (2022)
Preview abstract
Graph representations of programs are commonly a central element of machine learning for code research. We introduce an open source Python library python_graphs that applies static analysis to construct graph representations of Python programs suitable for training machine learning models. Our library admits the construction of control-flow graphs, data-flow graphs, and composite "program graphs" that combine control-flow, data-flow, syntactic, and lexical information about a program. We present the capabilities and limitations of the library, perform a case-study applying the library to millions of competitive programming submissions, and showcase the library's utility for machine learning research.
View details
SpreadsheetCoder: Formula Prediction from Semi-structured Context
Rishabh Singh
Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)
Preview abstract
Spreadsheet formula prediction has been an important
program synthesis problem with many
real-world applications. Previous works typically
utilize input-output examples as the specification
for spreadsheet formula synthesis, where each
input-output pair simulates a separate row in the
spreadsheet. However, this formulation does not
fully capture the rich context in real-world spreadsheets.
First, spreadsheet data entries are organized
as tables, thus rows and columns are not necessarily
independent from each other. In addition,
many spreadsheet tables include headers, which
provide high-level descriptions of the cell data.
However, previous synthesis approaches do not
consider headers as part of the specification. In
this work, we present the first approach for synthesizing
spreadsheet formulas from tabular context,
which includes both headers and semi-structured
tabular data. In particular, we propose SpreadsheetCoder,
a BERT-based model architecture
to represent the tabular context in both row-based
and column-based formats. We train our model on
a large dataset of spreadsheets, and demonstrate
that SpreadsheetCoder achieves top-1 prediction
accuracy of 42:51%, which is a considerable
improvement over baselines that do not employ
rich tabular context. Compared to a rule-based
system, SpreadsheetCoder assists 82% more
users in composing formulas on Google Sheets.
View details
Latent Programmer: Discrete Latent Codes for Program Synthesis
Joey Hong
David Martin Dohan
Rishabh Singh
International Conference on Machine Learning (ICML) (2021)
Preview abstract
In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences.
We propose to learn representations of the outputs that is specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient.
An appealing realization of such representation are discrete latent codes, as this naturally allows sophisticated combinatorial search strategies.
The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task.
Based on these insights, we introduce the Latent Programmer, a program synthesis method that first predicts a discrete latent codes from input/output examples, and then generates the program in the target language.
We evaluate the Latent Programmer on two domains: synthesis of string transformation programs, and generation of programs from natural language descriptions.
We demonstrate that the discrete latent representation significantly improves synthesis accuracy.
View details
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Guy Gur-Ari
Henryk Witold Michalewski
David Martin Dohan
Aitor Lewkowycz
Maarten Paul Bosma
David Luan
Augustus Odena
(2021)
Preview abstract
Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text (Brown et al., 2020) or synthesizing computer programs (Chen et al., 2021; Austin et al., 2021). However, they struggle with tasks that require unbounded multi-step computation, such as adding integers (Brown et al., 2020) or executing programs (Austin et al., 2021). Surprisingly, we find that these same models are able to perform complex multistep computations—even in the few-shot regime—when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train Transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.
View details
Program Synthesis with Large Language Models
Augustus Odena
David Martin Dohan
Ellen Jiang
Henryk Michalewski
Maarten Paul Bosma
Maxwell Nye
n/a, n/a, n/a (2021), n/a
Preview abstract
Program synthesis is one of the grand challenges of artificial intelligence, but to date practical successes have focused on narrow settings and restricted domains. Large language models trained on massive corpora of web texts which include open-source code, programming websites, and tutorials have the potential to break through this barrier.This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate the performance of the language model LaMDA PT [Freitas et al.,2021] on several program synthesis tasks, at a variety of scales ranging from 244M to 137B parameters. First, we introduce a new benchmark, Mostly Basic Programming Problems (MBPP), to measure the ability of these models to synthesize short Python programs from natural language descriptions. The benchmark consists of around 1000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and automated test-cases. We also introduce a Python version of the MathQA benchmark, which evaluates the ability of the models to synthesize code from more complex text. On both datasets, we evaluate synthesis performance and find that synthesis performance scales log-linearly with model size. In contrast to some previous work, we find that LaMDAPT achieves non-negligible preformance in a few-shot setting, although fine-tuning still performs much better. Thel argest models we consider can synthesize solutions to 58% of the problems from MBPP using few-shot learning with a well-designed prompt; across model sizes, fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points. Finally, we conduct a thorough error analysis, shedding light on where these models fall short as program synthesizers, what types of programs are most difficult to generate, and how the models might be improved. As part of that analysis, we explore the semantic grounding of these models, finding that even our largest models are generally unable to predict the output of a program given a specific input.
View details
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration
Augustus Odena
Rishabh Singh
International Conference on Learning Representations (ICLR) (2021)
Preview abstract
Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analyzing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties. First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on two datasets, one from the SyGuS competition and one of our own creation.
View details
Learning Semantic Representations to Verify Hardware Designs
Shobha Vasudevan
Rishabh Singh
Hamid Shojaei
Richard Ho
Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
We introduce Design2Vec, a representation learning approach to learn semantic abstractions of hardware designs at the Register Transfer Level (RTL). The key idea of our approach is to design a graph convolution based neural architecture that embeds RTL syntax and semantics. We train the architecture on the task of predicting coverage in the design, given some input test stimulus. We then present an approach to use the learnt RTL representation to automatically generate new tests for unseen coverage locations in the design. Our experimental results demonstrate that Design2Vec outperforms several baseline approaches that do not incorporate the RTL semantics and it can be used to generate instantaneous coverage predictions compared to nightly simulation times. Moreover, the tests generated using Design2Vec result in coverage of design points that are difficult to cover for design verification experts using the current manual approaches for test generation.
View details
Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration
Rishabh Singh
Advances in Neural Information Processing Systems (NeurIPS) (2020) (to appear)
Preview abstract
Discrete structures play an important role in applications like program language modeling, and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice of the flexibility.
Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions. Learning and inference with EBMs requires the partition function estimation which is intractable in general. For discrete structured data this is even more challenging due to the absence of gradients. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete, structured data, where parameter gradients are estimated using a learned sampler mimicking random local search, and thus, achieving a better trade-off between flexibility and tractability.
We show that the sampler can still be trained efficiently using a bias reduction principle that alternates between importance reweighted maximum likelihood estimation and Gibbs sampling.
Experimentally, we show that learning local search leads to significant improvements on two different domains in program synthesis and software engineering. Most notably, our energy model guided fuzzer for software testing can achieve comparable performance to well engineered fuzzing engines like libfuzzer on some targets.
View details
Preview abstract
We present a new program synthesis approach that combines an encoder-decoder
based synthesis architecture with a differentiable program fixer. Our approach is
inspired from the fact that human developers seldom get their program correct in
the first attempt, and perform iterative testing-based program fixing to get to the
desired program functionality. Similarly, our approach first learns a distribution
over programs conditioned on an encoding of a set of input-output examples, and
then iteratively performs fix operations using the program fixer module. The fixer
module takes as input the original examples and the current program outputs on
example inputs, and generates a new distribution over the programs with the goal
of reducing the discrepancies between the current program outputs and the desired
example outputs. We train our architecture end-to-end on the RobustFill domain,
and show that the addition of fixer module leads to a significant improvement of
7% on synthesis compared to using beam search.
View details
Learning to represent programs with property signatures
Augustus Odena
International Conference on Learning Representations (ICLR), n/a, n/a (2020), n/a (to appear)
Preview abstract
We introduce the notion of property signatures, a representation for programs and
program specifications meant for consumption by machine learning algorithms.
Given a function with input type τ_in and output type τ_out, a property is a function
of type: (τ_in, τ_out) → Bool that (informally) describes some simple property
of the function under consideration. For instance, if τ_in and τ_out are both lists
of the same type, one property might ask ‘is the input list the same length as the
output list?’. If we have a list of such properties, we can evaluate them all for our
function to get a list of outputs that we will call the property signature. Crucially,
we can ‘guess’ the property signature for a function given only a set of input/output
pairs meant to specify that function. We discuss several potential applications of
property signatures and show experimentally that they can be used to improve
over a baseline synthesizer so that it emits twice as many programs in less than
one-tenth of the time.
View details
Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks
Thirty-fourth Conference on Neural Information Processing Systems (2020)
Preview abstract
Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Network (IPA-GNN), which achieves systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by developing a spectrum of models between RNNs operating on program traces with branch decisions as latent variables and GNNs. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a value function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.
View details
Where should I comment my code? A dataset and model for predicting locations which need comments
Earl T. Barr
Michael Ernst
Santanu Dash
International Conference on Software Engineering (ICSE) (2020)
Preview abstract
It is important to write code comments. Programmers should not
comment every line of code: doing so would clutter the code, and
programmers do not have time to do so in any event. Programmers
must judiciously decide where to write code comments.
We have created a machine learning model that suggests locations where a programmer should write a code comment. We
trained it on existing high quality commented code to learn locations chosen which are chosen by developers. Once trained, the
model can predict locations on new code. We find that our models
can achieve good accuracy on this task but there is a lot of scope
for future improvements.
View details
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Rafael-Michael Karampatsis
Hlib Babii
Romain Robbes
Andrea Janes
International Conference on Software Engineering (ICSE) (2020) (to appear)
Preview abstract
Statistical language modeling techniques have successfully been
applied to large source code corpora, yielding a variety of new
software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these
techniques is that code introduces new vocabulary at a far higher
rate than natural language, as new identifier names proliferate.
Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading
their performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various
modelling choices impact the resulting vocabulary on a large-scale
corpus of 13,362 projects; 2) presenting an open vocabulary source
code NLM that can scale to such a corpus, 100 times larger than in
previous work; and 3) showing that such models outperform the
state of the art on three distinct code corpora (Java, C, Python). To
our knowledge, these are the largest NLMs for code that have been
reported.
View details
Global Relational Models of Source Code
Vincent Josua Hellendoorn
Rishabh Singh
International Conference on Learning Representations (ICLR) (2020)
Preview abstract
Models of code can learn distributed representations of a program’s syntax and
semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as
trees, graphs and paths therein (e.g., data-flow relations), which are precise and
abundantly available for code. This provides a strong inductive bias towards semantically
meaningful relations, yielding more generalizable representations than
classical sequence-based models. Unfortunately, these models primarily rely on
graph-based message passing to represent relations in code, which makes them
de facto local due to the high cost of message-passing steps, quite in contrast to
modern, global sequence-based models, such as the Transformer. In this work,
we bridge this divide between global and structured models by introducing two
new hybrid model families that are both global and incorporate structural bias:
Graph Sandwiches, which wrap traditional (gated) graph message-passing layers
in sequential message-passing layers; and Graph Relational Embedding Attention
Transformers (GREAT for short), which bias traditional Transformers with
relational information from graph edge types. By studying a popular, non-trivial
program repair task, variable-misuse identification, we explore the relative merits
of traditional and hybrid model families for code representation. Starting with a
graph-based model that already improves upon the prior state-of-the-art for this
task by 20%, we show that our proposed hybrid models improve an additional
10–15%, while training both faster and using fewer parameters.
View details
Incremental Sampling Without Replacement for Sequence Models
Proceedings of the 37th International Conference on Machine Learning (2020)
Preview abstract
Sampling is a fundamental technique, and sampling without replacement is often desirable when duplicate samples are not beneficial. Within machine learning, sampling is useful for generating diverse outputs from a trained model. We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs sequentially. Our procedure is efficient even for exponentially-large output spaces. Unlike prior work, our approach is incremental, i.e., samples can be drawn one at a time, allowing for increased flexibility. We also present a new estimator for computing expectations from samples drawn without replacement. We show that incremental sampling without replacement is applicable to many domains, e.g., program synthesis and combinatorial optimization.
View details
Preview abstract
Bayesian nonparametric models provide a principled way to automatically adapt the complexity of a model to the amount of the data available, but computation in such models is difficult. Amortized variational approximations are appealing because of their computational efficiency, but current methods rely on a fixed finite truncation of the infinite model. This truncation level can be difficult to set, and also interacts poorly with amortized methods due to the over-pruning problem. Instead, we propose a new variational approximation, based on a method from statistical physics called Russian roulette sampling. This allows the variational distribution to adapt its complexity during inference, without relying on a fixed truncation level, and while still obtaining an unbiased estimate of the gradient of the original variational objective. We demonstrate this method on infinite sized variational auto-encoders using a Beta-Bernoulli (Indian buffet process) prior.
View details
Preview abstract
We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods. Our framework, called HOUDINI, represents neural networks as strongly typed, differentiable functional programs that use symbolic higher-order combinators to compose a library of neural functions. Our learning algorithm consists of: (1) a symbolic program synthesizer that performs a type-directed search over parameterized programs, and decides on the library functions to reuse, and the architectures to combine them, while learning a sequence of tasks; and (2) a neural module that trains these programs using stochastic gradient descent. We evaluate HOUDINI on three benchmarks that combine perception with the algorithmic tasks of counting, summing, and shortest-path computation. Our experiments show that HOUDINI transfers high-level concepts more effectively than traditional transfer learning and progressive neural networks, and that the typed representation of networks significantly accelerates the search.
View details
No Results Found