Hossein Mobahi

Hossein Mobahi

I am a Research Scientist in the Machine Perception team at Google since May 2016. Prior to that, I was a Postdoctoral Researcher in Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT, where I was privileged to work with Bill Freeman and John Fisher. I am broadly interested in Artificial Intelligence. Specifically my research lies at the intersection of Computer Vision, Machine Learning, and Optimization. My work is often guided by mathematical principles. I graduated from University of Illinois at Urbana-Champaign (UIUC) with a PhD in Computer Science, where I was fortunate to be supervised by Prof. Yi Ma.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Sharpness-Aware Minimization Improves Language Model Generalization
    Yi Tay
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022), pp. 7360-7371
    Preview abstract The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited. View details
    Preview abstract We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We consider separable classification and underdetermined linear regression problems where there exist many solutions that achieve zero training error, and characterize how the network architecture and initialization affects the final solution found by gradient flow. Our results apply to a general tensor formulation of neural networks that includes linear fully-connected networks, linear diagonal networks, and linear convolutional networks as special cases, while removing convergence assumptions required by prior research. We also provide experiments that corroborate our theoretical analysis. View details
    Preview abstract In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https://github.com/google-research/sam. View details
    Methods and Analysis of The First Competition in Predicting Generalization of Deep Learning
    Yiding Jiang
    Parth Natekar
    Manik Sharma
    Sumukh K. Aithal
    Dhruva Kashyap
    Natarajan Subramanyam
    Carlos Lassance
    Daniel M. Roy
    Gintare Karolina Dziugaite
    Suriya Gunasekar
    Isabelle Guyon
    Pierre Foret
    Scott Yak i
    Samy Bengio
    Proceedings of the NeurIPS 2020 Competition and Demonstration Track, PMLR (2021)
    Preview abstract Deep learning has been recently successfully applied to an ever larger number of problems, ranging from pattern recognition to complex decision making. However, several concerns have been raised, including guarantees of good generalization, which is of foremost importance. Despite numerous attempts, conventional statistical learning approaches fall short of providing a satisfactory explanation on why deep learning works. In a competition hosted at the Thirty-Fourth Conference on Neural Information Processing Systems (NeurIPS 2020), we invited the community to design robust and general complexity measures that can accurately predict the generalization of models. In this paper, we describe the competition design, the protocols, and the solutions of the top-three teams at the competition in details. In addition, we discuss the outcomes, common failure modes, and potential future directions for the competition. View details
    NeurIPS 2020 Competition: Predicting Generalization in Deep Learning
    Yiding Jiang
    Pierre Foret
    Scott Yak
    Daniel M. Roy
    Gintare Karolina Dziugaite
    Samy Bengio
    Suriya Gunasekar
    Isabelle Guyon
    arXiv (2020)
    Preview abstract Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, conventional statistical learning approaches have yet been able to provide a satisfactory explanation on why deep learning works. A recent line of works aims to address the problem by trying to predict the generalization performance through complexity measures. In this competition, we invite the community to propose complexity measures that can accurately predict generalization of models. A robust and general complexity measure would potentially lead to a better understanding of deep learning's underlying mechanism and behavior of deep models on unseen data, or shed light on better generalization bounds. All these outcomes will be important for making deep learning more robust and reliable. View details
    Preview abstract Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research. View details
    Self-Distillation Amplifies Regularization in Hilbert Space
    Mehrdad Farajtabar
    Peter Bartlett
    Neural Information Processing Systems (NeurIPS) (2020)
    Preview abstract Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to ℓ2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance. View details
    A Margin-Based Measure of Generalization for Deep Networks
    Yiding Jiang
    Dilip Krishnan
    Samy Bengio
    ICLR (2019)
    Preview abstract Recent research has demonstrated that deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap can be predicted from training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization. View details
    Large Margin Deep Networks for Classification
    Gamaleldin Fathy Elsayed
    Dilip Krishnan
    Samy Bengio
    NeurIPS (2018)
    Preview abstract We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with preset feature representation; and existing margin methods for neural networks only enforce margin at the output layer, or are formulated with weak approximations to the true margin. This keeps margin methods inaccessible to models like deep networks. In this paper, we propose a novel loss function to impose a margin on any set of layers of deep network and show promising empirical results that consistently outperform cross-entropy based models across different application scenarios such as adversarial examples and generalization from small training sets. Our formulation allows choosing any norm for the margin. The resulting loss is general and complementary to existing regularization techniques such as weight decay, dropout and batch norm. It is applicable to any classification task where cross-entropy is used. View details
    Homotopy Analysis for Tensor PCA
    Anima Anandkumar
    Rong Ge
    Conference on Learning Theory (2017), pp. 79-104
    Preview abstract Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the “high noise” regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the \em best degree-4 sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows us to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees. View details