Hossein Mobahi
            I am a Research Scientist at Google DeepMind. I joined Google Research in 2016. Before that, I was a Postdoctoral Researcher at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT, where I had the privilege of working with Professor Bill Freeman and Dr. John Fisher. I earned my PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC), where I was fortunate to be supervised by Professor Yi Ma.
My broad interest lies in Artificial Intelligence, with a specific focus on the intersection of Machine Learning and Optimization. My research is often guided by mathematical principles, aiming to develop practically successful methods that offer greater clarity in their understanding or improved performance.
I am a co-creator of the Sharpness Aware Minimization (SAM) method. My work also includes contributions to the theory of self-distillation, optimization by the continuation method, and understanding the loss surface of neural networks. I have a long-standing interest in radically new approaches to creating neural architectures, particularly those inspired by biology and the human brain, as well as curriculum learning strategies that aim to enhance training efficiency and generalization through more effective data presentation.
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              PlanGEN: A Framework Utilizing Inference-Time Algorithms with LLM Agents for Planning and Reasoning
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Hootan Nakhost
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Mihir Parmar
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Swaroop Mishra
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chitta Baral
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jindong Gu
                      
                    
                  
              
            
          
          
          
          
            2025
          
          
        
        
        
          
              Preview abstract
          
          
              Scaling inference-time computation in Large Language Models (LLMs) dramatically improves their capabilities for solving complex problems. While test-time scaling has shown promise in many tasks such as code generation and mathematical reasoning, integration of inference-time algorithms into multi-agent frameworks for planning and reasoning remains under-explored. To this end, we explore popular inference-time algorithms—Best of N, Tree of Thought (ToT), and REward BAlanced SEarch (REBASE)—with proposed feedback-driven refinement. Our feedback-driven refinement employs specialized agents: a constraint agent to enforce task instance-specific constraints, and a verifier agent to evaluate plan quality. Furthermore, we hypothesize that test-time scaling can be proportional to instance-level complexity. Thus, we propose an additional selection agent to dynamically optimize algorithm choice. We evaluate our proposed approaches on four different benchmarks, i.e., NATURAL PLAN, GPQA, OlympiadBench, and DocFinQA. Experimental results show that our methods outperform strong baselines, achieving state-of-the-art results in NATURAL PLAN, OlympiadBench , and DocFinQA. Our key findings demonstrate that constraint-guided iterative refinement and algorithm selection improves both planning and downstream reasoning in LLMs
              
  
View details
          
        
      
    
        
          
            
              Sharpness-Aware Minimization Improves Language Model Generalization
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Yi Tay
                      
                    
                  
              
            
          
          
          
          
            Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022), pp. 7360-7371
          
          
        
        
        
          
              Preview abstract
          
          
              The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at  https://github.com/google-research/sam.
              
  
View details
          
        
      
    
        
          
            
              Methods and Analysis of The First Competition in Predicting Generalization of Deep Learning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Yiding Jiang
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Parth Natekar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Manik Sharma
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sumukh K. Aithal
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dhruva Kashyap
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Natarajan Subramanyam
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Carlos Lassance
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel M. Roy
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Gintare Karolina Dziugaite
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Suriya Gunasekar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Isabelle Guyon
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pierre Foret
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Scott Yak i
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Behnam Neyshabur
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Samy Bengio
                      
                    
                  
              
            
          
          
          
          
            Proceedings of the NeurIPS 2020 Competition and Demonstration Track, PMLR (2021)
          
          
        
        
        
          
              Preview abstract
          
          
              Deep learning has been recently successfully applied to an ever larger number of problems, ranging from pattern recognition to complex decision making. However, several concerns have been raised, including guarantees of good generalization, which is of foremost importance. Despite numerous attempts, conventional statistical learning approaches fall short of providing a satisfactory explanation on why deep learning works. In a competition hosted at the Thirty-Fourth Conference on Neural Information Processing Systems (NeurIPS 2020), we invited the community to design robust and general complexity measures that can accurately predict the generalization of models. In this paper, we describe the competition design, the protocols, and the solutions of the top-three teams at the competition in details. In addition, we discuss the outcomes, common failure modes, and potential future directions for the competition.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We consider separable classification and underdetermined linear regression problems where there exist many solutions that achieve zero training error, and characterize how the network architecture and initialization affects the final solution found by gradient flow. Our results apply to a general tensor formulation of neural networks that includes linear fully-connected networks, linear diagonal networks, and linear convolutional networks as special cases, while removing convergence assumptions required by prior research. We also provide experiments that corroborate our theoretical analysis.
              
  
View details
          
        
      
    
        
          
            
              NeurIPS 2020 Competition: Predicting Generalization in Deep Learning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Yiding Jiang
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Pierre Foret
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Scott Yak
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel M. Roy
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Gintare Karolina Dziugaite
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Samy Bengio
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Suriya Gunasekar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Isabelle Guyon
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Behnam Neyshabur
                      
                    
                  
              
            
          
          
          
          
            arXiv (2020)
          
          
        
        
        
          
              Preview abstract
          
          
              Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, conventional statistical learning approaches have yet been able to provide a satisfactory explanation on why deep learning works. A recent line of works aims to address the problem by trying to predict the generalization performance through complexity measures. In this competition, we invite the community to propose complexity measures that can accurately predict generalization of models. A robust and general complexity measure would potentially lead to a better understanding of deep learning's underlying mechanism and behavior of deep models on unseen data, or shed light on better generalization bounds. All these outcomes will be important for making deep learning more robust and reliable.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
              
  
View details
          
        
      
    
        
          
            
              Self-Distillation Amplifies Regularization in Hilbert Space
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Mehrdad Farajtabar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Peter Bartlett
                      
                    
                  
              
            
          
          
          
          
            Neural Information Processing Systems (NeurIPS) (2020)
          
          
        
        
        
          
              Preview abstract
          
          
              Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to ℓ2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Recent research has demonstrated that deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon
indicates that loss functions such as cross-entropy are not a reliable indicator of
generalization. This leads to the crucial question of how generalization gap can be
predicted from training data and network parameters. In this paper, we propose
such a measure, and conduct extensive empirical studies on how well it can predict
the generalization gap. Our measure is based on the concept of margin distribution,
which are the distances of training points to the decision boundary. We find that
it is necessary to use margin distributions at multiple layers of a deep network.
On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates
very strongly with the generalization gap. In addition, we find the following other
factors to be of importance: normalizing margin values for scale independence,
using characterizations of margin distribution rather than just the margin (closest
distance to decision boundary), and working in log space instead of linear space
(effectively using a product of margins rather than a sum). Our measure can be
easily applied to feedforward deep networks with any architecture and may point
towards new training loss functions that could enable better generalization.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We present a formulation of deep learning that aims at  producing a large margin classifier. The notion of margin has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with preset feature representation; and existing margin methods for neural networks only enforce margin at the output layer, or are formulated with weak approximations to the true margin. This keeps margin methods inaccessible to models like deep networks. In this paper, we propose a novel loss function to impose a margin on any set of layers of deep network and show promising empirical results that consistently outperform cross-entropy based models across different application scenarios such as adversarial examples and generalization from small training sets. Our formulation allows choosing any norm for the margin. The resulting loss is general and complementary to existing regularization techniques such as weight decay, dropout and batch norm. It is applicable to any classification task where cross-entropy is used.
              
  
View details