Tackling fundamental questions in deep learning and physics using a scientific approach. Our main focus is on understanding and improving the capabilities of large language models.
About the team
Our goal is to understand the principles that govern machine learning and improve their capabilities. We are focused on understanding the limitations of large scale transformer models and extending their capabilities to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning.
In these domains, agents can make use of very long context, adaptive inference-time compute (e.g., scratchpad, recurrence, memory), external tools (e.g., library of functions, search engine, calculator, additional models), or other methods to solve out-of-training-domain problems when using instructions and provided with a few examples.
Team focus summaries
Science of Deep Learning
Developing hypotheses, experimentally testing them and coming up with simple yet predictive theoretical models with a goal of understanding principles governing deep learning.
Long-Range Language Models
Push large language models to use very long effective context (e.g. millions of tokens) and generate long coherent content.
Contemplative Language Models
Extending the capabilities of large language models to solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning. We are mostly interested in scenarios and domains where all steps of the solution can be expressed in language -- natural or otherwise.
We introduce Minerva, a large language model that achieves state-of-the-art performance on solving mathematics, science, and engineering problems without the use of external tools.
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 207 tasks, contributed by over 400 authors across 132 institutions intended to probe large language models, and extrapolate their future capabilities.
We found that a 540-billion parameter language model shows the continued benefits of scaling by matching or surpassing human performance on a diverse set of tasks.
We show that asking large language models to write their intermediate computations in a scratchpad would enable them to perform complex tasks involving multi-step computations.
We run careful empirical studies exploring the length generalization capabilities of transformer-based language models highlight the role of in-context learning and scratchpad.
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length.
We propose, derive, and investigate a categorization of scaling laws for generalization in neural networks.
We propose Sharpness-Aware Minimization (SAM), an optimization algorithm that improves generalization by seeking parameters that lie in neighborhoods having uniformly low loss.