Hossein Mobahi
I am a Research Scientist in the Machine Perception team at Google since May 2016. Prior to that, I was a Postdoctoral Researcher in Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT, where I was privileged to work with Bill Freeman and John Fisher.
I am broadly interested in Artificial Intelligence. Specifically my research lies at the intersection of Computer Vision, Machine Learning, and Optimization. My work is often guided by mathematical principles.
I graduated from University of Illinois at Urbana-Champaign (UIUC) with a PhD in Computer Science, where I was fortunate to be supervised by Prof. Yi Ma.
Research Areas
Authored Publications
Sort By
Sharpness-Aware Minimization Improves Language Model Generalization
Yi Tay
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022), pp. 7360-7371
Preview abstract
The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.
View details
Preview abstract
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We consider separable classification and underdetermined linear regression problems where there exist many solutions that achieve zero training error, and characterize how the network architecture and initialization affects the final solution found by gradient flow. Our results apply to a general tensor formulation of neural networks that includes linear fully-connected networks, linear diagonal networks, and linear convolutional networks as special cases, while removing convergence assumptions required by prior research. We also provide experiments that corroborate our theoretical analysis.
View details
Preview abstract
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https://github.com/google-research/sam.
View details
Methods and Analysis of The First Competition in Predicting Generalization of Deep Learning
Yiding Jiang
Parth Natekar
Manik Sharma
Sumukh K. Aithal
Dhruva Kashyap
Natarajan Subramanyam
Carlos Lassance
Daniel M. Roy
Gintare Karolina Dziugaite
Suriya Gunasekar
Isabelle Guyon
Pierre Foret
Scott Yak i
Samy Bengio
Proceedings of the NeurIPS 2020 Competition and Demonstration Track, PMLR (2021)
Preview abstract
Deep learning has been recently successfully applied to an ever larger number of problems, ranging from pattern recognition to complex decision making. However, several concerns have been raised, including guarantees of good generalization, which is of foremost importance. Despite numerous attempts, conventional statistical learning approaches fall short of providing a satisfactory explanation on why deep learning works. In a competition hosted at the Thirty-Fourth Conference on Neural Information Processing Systems (NeurIPS 2020), we invited the community to design robust and general complexity measures that can accurately predict the generalization of models. In this paper, we describe the competition design, the protocols, and the solutions of the top-three teams at the competition in details. In addition, we discuss the outcomes, common failure modes, and potential future directions for the competition.
View details
NeurIPS 2020 Competition: Predicting Generalization in Deep Learning
Yiding Jiang
Pierre Foret
Scott Yak
Daniel M. Roy
Gintare Karolina Dziugaite
Samy Bengio
Suriya Gunasekar
Isabelle Guyon
arXiv (2020)
Preview abstract
Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, conventional statistical learning approaches have yet been able to provide a satisfactory explanation on why deep learning works. A recent line of works aims to address the problem by trying to predict the generalization performance through complexity measures. In this competition, we invite the community to propose complexity measures that can accurately predict generalization of models. A robust and general complexity measure would potentially lead to a better understanding of deep learning's underlying mechanism and behavior of deep models on unseen data, or shed light on better generalization bounds. All these outcomes will be important for making deep learning more robust and reliable.
View details
Preview abstract
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
View details
Self-Distillation Amplifies Regularization in Hilbert Space
Mehrdad Farajtabar
Peter Bartlett
Neural Information Processing Systems (NeurIPS) (2020)
Preview abstract
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to ℓ2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
View details
Preview abstract
Recent research has demonstrated that deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon
indicates that loss functions such as cross-entropy are not a reliable indicator of
generalization. This leads to the crucial question of how generalization gap can be
predicted from training data and network parameters. In this paper, we propose
such a measure, and conduct extensive empirical studies on how well it can predict
the generalization gap. Our measure is based on the concept of margin distribution,
which are the distances of training points to the decision boundary. We find that
it is necessary to use margin distributions at multiple layers of a deep network.
On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates
very strongly with the generalization gap. In addition, we find the following other
factors to be of importance: normalizing margin values for scale independence,
using characterizations of margin distribution rather than just the margin (closest
distance to decision boundary), and working in log space instead of linear space
(effectively using a product of margins rather than a sum). Our measure can be
easily applied to feedforward deep networks with any architecture and may point
towards new training loss functions that could enable better generalization.
View details
Preview abstract
We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with preset feature representation; and existing margin methods for neural networks only enforce margin at the output layer, or are formulated with weak approximations to the true margin. This keeps margin methods inaccessible to models like deep networks. In this paper, we propose a novel loss function to impose a margin on any set of layers of deep network and show promising empirical results that consistently outperform cross-entropy based models across different application scenarios such as adversarial examples and generalization from small training sets. Our formulation allows choosing any norm for the margin. The resulting loss is general and complementary to existing regularization techniques such as weight decay, dropout and batch norm. It is applicable to any classification task where cross-entropy is used.
View details
Preview abstract
Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the “high noise” regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the \em best degree-4 sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows us to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.
View details