Hanie Sedghi
I am a senior research scientist at Brain team. My approach is to bond theory and practice in large-scale machine learning. Over the recent years. I have been working on understanding deep learning phenomena and improving the training algorithms. I lead the DeepPhenomena team at Google.
Research Areas
Authored Publications
Sort By
Preview abstract
Distribution shift is a prevalent problem in the real-world deployment of machine learning models. Typically a mismatch between the source (training) and target (test) distribution leads to a gap between the source and target performance of the model. In this work, we investigate methods that leverage only unlabeled target data to predict accuracy under distribution shift. We propose a simple and effective method called Average Thresholded Confidence (ATC) that learns a scalar \emph{threshold} on model confidence on source data and predicts model performance as the average number of unlabeled target examples above the identified threshold. ATC outperforms previous approaches across several model architectures and various types of distribution shifts (e.g. synthetic corruptions, shifts due to dataset reproduction, or shifts due to novel subpopulations) applied to FMoW-\textsc{wilds}, ImageNet, CIFAR, and MNIST datasets. ATC estimates target performance up to $2\text{--}3\times$ more accurately compared to recently proposed methods. Finally, we theoretically analyze our proposed method on a toy distribution shift model with varying degrees of spurious correlation.
View details
Preview abstract
Recent developments in large-scale machine learning have created a tempting picture suggesting that by scaling up data, model size and training time properly, one can obtain a model that can be used successfully in few-shot settings in all downstream tasks. In this work, we investigate this premise empirically and provide a strong case against it. In particular, we consider image recognition task with large scale models (Vision Transformers) trained on the largest scale of available data (JFT). We show that as we improve the performance of upstream task either by scaling up or hyper-parameter and architectural choices, the performance of many downstream tasks eventually plateau. We showcase an even more extreme scenario where performance on upstream and downstream contradict each other, i.e., in order to have a better downstream performance, we need to hurt upstream accuracy. We delve deeper into understanding the reasons that give rise to these phenomena by designing interventions and investigating different components of the models which gives us crude yet useful insights into the mechanisms behind these observations.
View details
Preview abstract
Understanding the loss landscape of deep neural networks has been the subject of many studies due to its close connections to optimization and generalization. Prior work has shown that there is often a performance barrier along the linear interpolation of the weights of two models trained with different initial seeds. In this work, we first empirically investigate how different model parameters and data distributions impact such performance barriers. Next, we consider the invariances in the function space of neural networks that arise from permutation of hidden units. We investigate this through extensive experiments and provide several pieces of evidence that if these invariances are taken into account, many of the barriers vanish.
View details
Preview abstract
We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is "because" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.
View details
Avoiding Spurious Correlations: Bridging Theory and Practice
Thao Nguyen
Vaishnavh Nagarajan
Behnam Neyshabur
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications
Preview abstract
Distribution shifts in the wild jeopardize the performance of machine learning models as they tend to pick up spurious correlations during training. Recent work \cite{nagarajan2020understanding} has characterized two specific failure modes of out-of-distribution (OOD) generalization, and we extend this theoretical framework by interpreting existing algorithms as solutions to these failure modes. We then evaluate them on different image classification datasets, and in the process surface two issues that are central to existing robustness techniques. For those that rely on group annotations, we show how the group information in standard benchmark datasets is unable to fully capture the spurious correlations present. For those that don't require group annotations, the validation set utilized for model selection still carries assumptions that are not realistic in real-world settings, and we show how this choice of shifts in validation set could impact performance of different OOD algorithms.
View details
Preview abstract
We study the phenomenon that some modules of deep neural networks (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained parameters, results in a large drop in the network's performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measure, called module criticality, based on the shape of the valleys that connects the initial and final values of the module parameters. We formulate how generalization relates to the module criticality, and show that this measure is able to explain the superior generalization performance of some architectures over others, whereas earlier measures fail to do so.
View details
Preview abstract
We prove bounds on the generalization error of convolutional networks.
The bounds are in terms of the training loss, the number of
parameters, the Lipschitz constant of the loss and the distance from
the weights to the initial weights. They are independent of the
number of pixels in the input, and the height and width of hidden
feature maps. We present experiments
with CIFAR-10 and a scaled-down variant, along with varying hyperparameters
of a deep convolutional network, comparing our bounds with practical
generalization gaps.
View details
Preview abstract
One desired capability for machines is the ability to transfer their understanding of one domain to another domain where data is (usually) scarce. Despite ample adaptation of transfer learning in many deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analysis to address these fundamental questions.
We separate the effect of feature reuse from learning high-level statistics of data and show that some benefit of transfer learning comes from the latter.
View details
Preview abstract
We characterize the singular values of the linear transformation associated with
a standard 2D multi-channel convolutional layer, enabling their
efficient computation.
This characterization also leads to an algorithm for projecting a convolutional layer onto
an operator-norm ball.
We show that this is an effective regularizer;
for example, it improves the test error of a deep residual network
using batch normalization
on CIFAR-10 from 6.2% to 5.3%.
View details
Preview abstract
We analyze the joint probability distribution on the lengths of the vectors of hidden variables
in different layers of a fully connected deep network, when the weights and biases are chosen
randomly according to Gaussian distributions. We show that, if the activation function φ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases, and the activation function φ. We also show that this convergence may fail for φ that violate our assumptions.
View details