Jump to Content


Tackling fundamental questions in deep learning and physics using a scientific approach. Our main focus is on understanding and improving the capabilities of large language models.

Language models


Tackling fundamental questions in deep learning and physics using a scientific approach. Our main focus is on understanding and improving the capabilities of large language models.

About the team

Our goal is to understand the principles that govern machine learning and improve their capabilities. We are focused on understanding the limitations of large scale transformer models and extending their capabilities to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning.

In these domains, agents can make use of very long context, adaptive inference-time compute (e.g., scratchpad, recurrence, memory), external tools (e.g., library of functions, search engine, calculator, additional models), or other methods to solve out-of-training-domain problems when using instructions and provided with a few examples.

Team focus summaries

Capabilities of large transformers

Designing targeted experiments to systematically identify key areas of improvement in large-scale transformers and using the insights from experiments to develop qualitatively novel abilities.

Learn more

Science of deep learning

Developing hypotheses, experimentally testing them and coming up with simple yet predictive theoretical models with a goal of understanding principles governing deep learning.

Long-Range language models

Push large language models to use very long effective context (e.g. millions of tokens) and generate long coherent content.

Contemplative language models

Extending the capabilities of large language models to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning. We are mostly interested in scenarios and domains where all steps of the solution can be expressed in language -- natural or otherwise.

Highlighted projects

Featured publications

Preview abstract Recent developments in large-scale machine learning have created a tempting picture suggesting that by scaling up data, model size and training time properly, one can obtain a model that can be used successfully in few-shot settings in all downstream tasks. In this work, we investigate this premise empirically and provide a strong case against it. In particular, we consider image recognition task with large scale models (Vision Transformers) trained on the largest scale of available data (JFT). We show that as we improve the performance of upstream task either by scaling up or hyper-parameter and architectural choices, the performance of many downstream tasks eventually plateau. We showcase an even more extreme scenario where performance on upstream and downstream contradict each other, i.e., in order to have a better downstream performance, we need to hurt upstream accuracy. We delve deeper into understanding the reasons that give rise to these phenomena by designing interventions and investigating different components of the models which gives us crude yet useful insights into the mechanisms behind these observations. View details
Preview abstract Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of explicit curricula, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data. View details
Sharpness-Aware Minimization Improves Language Model Generalization
Yi Tay
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022), pp. 7360-7371
Preview abstract The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited. View details
Preview abstract We show theoretically and experimentally that both data whitening and second order optimization erase information about the training dataset, and can prevent any generalization for high dimensional datasets. First we show that if the input layer of a model is a dense linear layer, then the datapoint-datapoint second moment matrix contains all information which can be used to make predictions. Second, we show that for high dimensional datasets, where the number of features is at least as large as the number of datapoints, and where the whitening transform is computed on the full (train+test) dataset, whitening erases all information in this datapoint-datapoint second moment matrix. Generalization is thus completely impossible for models trained on high dimensional whitened datasets. Second order optimization of a linear model is identical to first order optimization of the same model after data whitening. Second order optimization can thus also prevent any generalization in similar situations. We experimentally verify these predictions for models trained on whitened data, and for linear models trained with an online Newton optimizer. We further experimentally demonstrate that generalization continues to be harmed even when the theoretical constraints on input dimensionality (for whitening), or linearity of the model (for second order optimization) are relaxed. View details
On the training dynamics of deep networks with L2 regularization.
Aitor Lewkowycz
Guy Gur-Ari
NeurIPS Oral (2020)
Preview abstract We study the role of L2 regularization in deep learning, and uncover simple relations between the performance of the model, the L2 coefficient, the learning rate, and the number of training steps. These empirical relations hold when the network is overparameterized. They can be used to predict the optimal regularization parameter of a given model. In addition, based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training. We test these proposals in modern image classification settings. Finally, we show that these empirical relations can be understood theoretically in the context of infinitely wide networks. We derive the gradient flow dynamics of such networks, and compare the role of L2 regularization in this context with that of linear models. View details
Preview abstract One desired capability for machines is the ability to transfer their understanding of one domain to another domain where data is (usually) scarce. Despite ample adaptation of transfer learning in many deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analysis to address these fundamental questions. We separate the effect of feature reuse from learning high-level statistics of data and show that some benefit of transfer learning comes from the latter. View details
Asymptotics of Wide Networks from Feynman Diagrams
Guy Gur-Ari
ICLR Spotlight (2019) (to appear)
Preview abstract Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically. View details
Preview abstract We study the phenomenon that some modules of deep neural networks (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained parameters, results in a large drop in the network's performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measure, called module criticality, based on the shape of the valleys that connects the initial and final values of the module parameters. We formulate how generalization relates to the module criticality, and show that this measure is able to explain the superior generalization performance of some architectures over others, whereas earlier measures fail to do so. View details
Preview abstract Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research. View details

Some of our locations