Tackling fundamental questions in deep learning and physics using a scientific approach. Our main focus is on understanding and improving the capabilities of large language models.
About the team
Our goal is to understand the principles that govern machine learning and other physical systems by developing models and experimentally testing hypotheses. We are focused on understanding the limitations of large scale transformer models and extending their capabilities to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning.
In these domains, agents can make use of very long context, adaptive inference-time compute (e.g., scratchpad, recurrence, memory), external tools (e.g., library of functions, search engine, calculator, additional models), or other methods to solve out-of-training-domain problems when using instructions and provided with a few examples.
Team focus summaries
Science of Deep Learning
Developing hypotheses, experimentally testing them and coming up with simple yet predictive theoretical models with a goal of understanding principles governing deep learning.
Long-Range Language Models
Push large language models to use very long effective context (e.g. millions of tokens) and generate long coherent content.
Contemplative Language Models
Extending the capabilities of large language models to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning. We are mostly interested in scenarios and domains where all steps of the solution can be expressed in language -- natural or otherwise.
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 207 tasks, contributed by over 400 authors across 132 institutions intended to probe large language models, and extrapolate their future capabilities.
We show that asking large language models to write their intermediate computations in a scratchpad would enable them to perform complex tasks involving multi-step computations.
We introduce a way of decomposing hidden states over a sequence into temporal (independent of input) and input-driven (independent of sequence position) components. This reveals how attention matrices are formed.
We propose Sharpness-Aware Minimization (SAM), an optimization algorithm that improves generalization by seeking parameters that lie in neighborhoods having uniformly low loss.
In this large-scale study, we demonstrate that as we increase the upstream accuracy by scaling up, the performance of downstream tasks saturates. We further investigate the reasons that give rise to these phenomena.
We propose, derive, and investigate a categorization of scaling laws for generalization in neural networks.
We find that robustness to catastrophic forgetting in pretrained models systematically improves with model and dataset scale.
We propose a new framework for reasoning about generalization in deep learning.