Learning theory

The Learning theory team at Google tackles fundamental learning theory problems significant to Google.

About the team

We are dedicated to advancing the theoretical foundations of machine learning (ML). Our team has extensive expertise in a variety of areas, including learning theory, statistical learning theory, optimization, decision making under uncertainty, reinforcement learning, and theory and algorithms in general. Our mission is twofold: to foster a principled understanding of ML techniques and to leverage this knowledge in designing highly effective algorithms. Ultimately, we aim to deploy these algorithms to achieve significant impact on Google, the wider academic community, and the scientific field of ML as a whole.

Team focus summaries

Optimization for machine learning

We work on optimization methods for machine learning in application areas, such as training large language models and federated learning.

Reinforcement learning

We design theoretically sound algorithms to solve real-world reinforcement learning problems, with applications including recommendation tasks, optimization of computer systems and fine-tuning of generative models.

Online learning, bandits, and active learning

Our research focuses on crafting algorithms and strategies for making sequential decisions in dynamic and uncertain environments based on partial information.

Learning dynamics in games

Multiplayer games provide a framework to understand the way that both humans and algorithms interact in complex systems, and we hope to understand and carefully design these systems to balance efficiency and equity.


We work on developing algorithms for training machine learning models with differential privacy, as well as alternative privacy guarantees.


We develop new learning algorithms with generalization guarantees for various learning scenarios.

Featured publications

On the convergence of Adam and Beyond
Satyen Kale
International Conference on Learning Representations(2018)
Preview abstract Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with “long-term memory” of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance. View details
Easy Learning from Label Proportions
Andres Munoz Medina
Robert Busa-Fekete
Travis Dick
Heejin Choi
Preview abstract We present a novel way of training models in theweakly supervised setup of learning frombagsof examples with just aggregate label informa-tion. Unlike previous work, we focus on learninga classifier that can produce accurate predictionsat an individual instance level as opposed to simply matching a bag’s label proportion. The mainstrength of our algorithm lies on the simplicity ofits implementation. We demonstrate that a simplemodification to the loss function can yield accu-rate event level estimates. Moreover we show thatthe sample complexity of doing empirical riskminimization or stochastic gradient descent withlabel proportions increases only by a factor of √k compared to traditional supervised learning sce-narios. Finally, we validate our theoretical resultson multiple datasets and demonstrate that our pro-posed algorithm beats provides state of the artperformance for learning with label proportions. View details
Multiple-policy High-confidence Policy Evaluation
Mohammad Ghavamzadeh
International Conference on Artificial Intelligence and Statistics(2023), pp. 9470-9487
Preview abstract In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods. View details
Layerwise Bregman Representation Learning of Neural Networks with Applications to Knowledge Distillation
Ehsan Amid
Rohan Anil
Christopher Fifty
Manfred Warmuth
Transactions on Machine Learning Research, 02/23(2023)
Preview abstract We propose a new method for layerwise representation learning of a trained neural network that conforms to the non-linearity of the layer’s transfer function. In particular, we form a Bregman divergence based on the convex function induced by the layer’s transfer function and construct an extension of the original Bregman PCA formulation by incorporating a mean vector and revising the normalization constraint on the principal directions. These modifications allow exporting the learned representation as a fixed layer with a non-linearity. As an application to knowledge distillation, we cast the learning problem for the student network as predicting the compression coefficients of the teacher’s representations, which is then passed as the input to the imported layer. Our empirical findings indicate that our approach is substantially more effective for transferring information between networks than typical teacher-student training that uses the teacher’s soft labels. View details
Preview abstract POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities. View details
A Model Selection Approach for Corruption Robust Reinforcement Learning
Chen-Yu Wei
33rd International Conference on Algorithmic Learning Theory (ALT 2022)(2022)
Preview abstract We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of O(min{1/∆,√T}+C)where T is the number of episodes, C is the total amount of corruption, and ∆ is the reward gap between the best and the second-best policies. This is the first worst-case optimal bound achieved without knowledge of C, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al.(2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of ̃O(√(1 +C)T), and another computationally inefficient one with O(√T+C),improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al.(2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results. View details
Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity
Alekh Agarwal
Tong Zhang
Neural Information Processing Systems(2023)
Preview abstract We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework. View details
Preview abstract Large deep learning models have achieved state-of-the-art performance across various natural language processing (NLP) tasks and demonstrated remarkable few-shot learning performance. However, training them is often challenging and resource-intensive. In this paper, we study an efficient approach to train language models using few-shot learners. We show that, by leveraging the fast learning nature of few-shot learners, one can train language models efficiently in a stagewise manner. Our main insight is that stacking a good few-shot learner on a good small language model provides a good initializer for a larger language model. Using this insight and building upon progressive stacking approaches, we develop novel approaches for training such networks in a stagewise manner. Furthermore, we also provide a theoretical framework and accompanying empirical studies to support our insights, thereby creating a theoretical foundation for progressive stacking. Finally, we provide empirical results to demonstrate the effectiveness of our approach in reducing the training time of few-shot learners. View details

Some of our locations