Anders Andreassen
Research Areas
Authored Publications
Sort By
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aitor Lewkowycz
Daniel Freeman
Guy Gur-Ari
Jaehoon Lee
Jascha Sohl-dickstein
Liam B. Fedus
TBD (2022)
Preview abstract
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to direct future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models.
To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench consists of 207 tasks, contributed by over 400 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on capabilities that are believed to be beyond current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. A team of human experts further performed all tasks, to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with human performance); model performance is remarkably similar across model classes; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit ``breakthrough'' behavior at a critical scale often involve a significant reasoning or algorithmic component; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
View details
Exploring Length Generalization in Large Language Models
Cem Anil
Yuhuai Wu
Aitor Lewkowycz
Guy Gur-Ari
NeurIPS Oral (2022)
Preview abstract
The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models. We first establish that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization. We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.
View details
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz
David Martin Dohan
Henryk Michalewski
Cem Anil
Imanol Schlag
Theo Gutman-Solo
Yuhuai Wu
Guy Gur-Ari
NeurIPS (2022)
Preview abstract
Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.
View details
Scaffolding Simulations with Deep Learning for High-dimensional Deconvolution
Patrick T. Komiske
Eric M. Metodiev
Benjamin Nachman
Adi Suresh
Jesse Thaler
ICLR SimDL Workshop (2021)
Preview abstract
A common setting for scientific inference is the ability to sample from a highfidelity forward model (simulation) without having an explicit probability density
of the data. We propose a simulation-based maximum likelihood deconvolution
approach in this setting called OMNIFOLD. Deep learning enables this approach
to be naturally unbinned and (variable-, and) high-dimensional. In contrast to
model parameter estimation, the goal of deconvolution is to remove detector distortions in order to enable a variety of down-stream inference tasks. Our approach
is the deep learning generalization of the common Richardson-Lucy approach that
is also called Iterative Bayesian Unfolding in particle physics. We show how OMNIFOLD can not only remove detector distortions, but it can also account for noise
processes and acceptance effects.
View details
Preview abstract
Although machine learning models typically experience a drop in performance
on out-of-distribution data, accuracies on in- versus out-of-distribution data are
widely observed to follow a single linear trend when evaluated across a testbed of
models. Models that are more accurate on the out-of-distribution data relative to this
baseline exhibit “effective robustness” and are exceedingly rare. Identifying such
models, and understanding their properties, is key to improving out-of-distribution
performance. We conduct a thorough empirical investigation of effective robustness
during fine-tuning and surprisingly find that models pre-trained on larger datasets
exhibit effective robustness during training that vanishes at convergence. We study
how properties of the data influence effective robustness, and we show that it
increases with the larger size, more diversity, and higher example difficulty of
the dataset. We also find that models that display effective robustness are able to
correctly classify 10% of the examples that no other current testbed model gets
correct. Finally, we discuss several strategies for scaling effective robustness to the
high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art
models.
View details
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Guy Gur-Ari
Henryk Witold Michalewski
David Martin Dohan
Aitor Lewkowycz
Maarten Paul Bosma
David Luan
Augustus Odena
(2021)
Preview abstract
Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text (Brown et al., 2020) or synthesizing computer programs (Chen et al., 2021; Austin et al., 2021). However, they struggle with tasks that require unbounded multi-step computation, such as adding integers (Brown et al., 2020) or executing programs (Austin et al., 2021). Surprisingly, we find that these same models are able to perform complex multistep computations—even in the few-shot regime—when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train Transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.
View details
The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics
Gregor Kasieczka
Benjamin Nachman
David Shih
Oz Amram
Kees Benkendorfer
Blaz Bortolato
Gustaaf Brooijmans
Florencia Canelli
Jack H. Collins
Biwei Dai
Felipe F. De Freitas
Barry M. Dillon
Ioan-Mihail Dinu
Zhongtian Dong
Julien Donini
Javier Duarte
A. Faroughy
Julia Gonski
Philip Harris
Alan Kahn
Jernej F. Kamenik
Charanjit K. Khosa
Patrick Komiske
Luc Le Pottier
Pablo Mart´ın-Ramiro
Andrej Matevc
Eric Metodiev
Vinicius Mikuni
Inˆes Ochoa
Sang Eon Park
Maurizio Pierini
Dylan Rankin
Veronica Sanz
Nilai Sarda
Uro˘s Seljak
Aleks Smolkovic
George Stein
Cristina Mantilla Suarez
Manuel Szewc
Jesse Thaler
Steven Tsan
Silviu-Marian Udrescu
Louis Vaslin
Jean-Roch Vlimant
Daniel Williams
Mikaeel Yunus
Rept.Prog.Phys., 84 (2021)
Preview abstract
A new paradigm for data-driven, model-agnostic new physics searches at colliders is emerging, and aims to leverage recent breakthroughs in anomaly detection and machine learning. In order to develop and benchmark new anomaly detection methods within this framework, it is essential to have standard datasets. To this end, we have created the LHC Olympics 2020, a community challenge accompanied by a set of simulated collider events. Participants in these Olympics have developed their methods using an R&D dataset and then tested them on black boxes: datasets with an unknown anomaly (or not). This paper will review the LHC Olympics 2020 challenge, including an overview of the competition, a description of methods deployed in the competition, lessons learned from the experience, and implications for data analyses with future datasets as well as future colliders.
View details
Preview abstract
Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way even in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.
View details
Parameter Estimation using Neural Networks in the Presence of Detector Effects
Shih-Chieh Hsu
Benjamin Nachman
Natchanon Suaysom
Adi Suresh
Phys. Rev. D, 103 (2020)
Preview abstract
Histogram-based template fits are the main technique used for estimating parameters of high energy physics Monte Carlo generators. Parametrized neural network reweighting can be used to extend this fitting procedure to many dimensions and does not require binning. If the fit is to be performed using reconstructed data, then expensive detector simulations must be used for training the neural networks. We introduce a new two-level fitting approach that only requires one dataset with detector simulation and then a set of additional generation-level datasets without detector effects included. This Simulation-level fit based on Reweighting Generator-level events with Neural networks (SRGN) is demonstrated using simulated datasets for a variety of examples including a simple Gaussian random variable, parton shower tuning, and the top quark mass extraction.
View details
Preview abstract
Wide neural networks have proven to be a rich class of architectures for both theory and practice. Motivated by the observation that finite width convolutional networks appear to outperform infinite width networks, we study scaling laws for wide CNNs and networks with skip connections. Following the approach of (Dyer & Gur-Ari, 2019), we present a simple diagrammatic recipe to derive the asymptotic width dependence for many quantities of interest. These scaling relationships provide a solvable description for the training dynamics of wide convolutional networks. We test these relations across a broad range of architectures. In particular, we find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width. Nonetheless, this relation is consistent with finite width models generalizing either better or worse than their infinite width counterparts, and we provide examples where the relative performance depends on the optimization details.
View details