Srinadh Bhojanapalli
I am a research scientist at Google research in New York. Earlier I was a research assistant professor at TTI Chicago. I obtained my PhD at The University of Texas at Austin where I was advised by Prof. Sujay Sanghavi.
My research is primarily focused on designing statistically efficient algorithms for large scale machine learning problems. I am interested in non-convex optimization, matrix and tensor factorization, neural networks and sub-linear time algorithms.
Research Areas
Authored Publications
Sort By
Efficient Language Model Architectures for Differentially Private Federated Learning
Yanxiang Zhang
Privacy Regulation and Protection in Machine Learning Workshop at ICLR 2024 (2024) (to appear)
Preview abstract
Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data ever leaving the devices.
SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency.
However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance.
In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively.
We propose a scale-invariant \emph{Coupled Input Forget Gate} (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell
and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments.
We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models.
Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms.
Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy.
View details
On Emergence of Activation Sparsity in Trained Transformers
Zonglin Li
Chong You
Daliang Li
Ke Ye
International Conference on Learning Representations (2023) (to appear)
Preview abstract
This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
View details
Teacher's pet: understanding and mitigating biases in distillation
Aditya Krishna Menon
Transactions on Machine Learning Research (2022)
Preview abstract
Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model.
Several works have shown that distillation significantly boosts the student's overall performance;
however, are these gains uniform across all data subgroups?
In this paper, we
show that
distillation can harm performance on
certain subgroups,
e.g., classes with few associated samples, compared to the vanilla student trained using the one-hot labels.
We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model.
To mitigate this problem,
we present techniques
which soften the teacher influence for subgroups where it is less reliable.
Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy,
while additionally ensuring improvement in subgroup performance.
View details
Coping with label shift via distributionally robust optimisation
Jingzhao Zhang
Aditya Krishna Menon
Suvrit Sra
International Conference on Learning Representations (2021)
Preview abstract
The label shift problem refers to the supervised learning setting wherein the train and test label distributions do not match. Existing work on this problem largely assumes access to an unlabelled test sample, which may be used to estimate the test label distribution. While such techniques have proven effective, it is not always feasible to access the target domain; further, this requires retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary shifts from a certain family? In this paper, we propose such a technique based on distributionally robust optimization (DRO) using f-divergences. We design a gradient descent-proximal mirror ascent algorithm tailored for large-scale finite-sum problems to efficiently optimize this objective, and establish its convergence. We show through experiments on CIFAR-100 and ImageNet that our technique can significantly improve performance over a number of baselines in settings where the test label distribution is varied.
View details
Understanding Robustness of Transformers for Image Classification
Daliang Li
Thomas Unterthiner
Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) (to appear)
Preview abstract
Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture such as the use of non-overlapping patches lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
View details
Low-Rank Bottleneck in Multi-head Attention Models
Chulhee Yun
International Conference on Machine Learning (ICML) 2020
Preview abstract
Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation, which we further validate with our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.
View details
Does label smoothing mitigate label noise?
Aditya Krishna Menon
International Conference on Machine Learning (2020) (to appear)
Preview abstract
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing can be competitive with loss-correction techniques under label noise. Further, we show that when performing distillation under label noise, label smoothing of the teacher can be beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial.
View details
Modifying Memories in Transformer Models
Chen Zhu
Daliang Li
International Conference on Machine Learning (ICML) 2021 (2020)
Preview abstract
Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification.
View details
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Chulhee Yun
Advances in Neural Information Processing Systems (2020)
Preview abstract
Transformer networks use pairwise attention to compute contextual embeddings of their inputs, and have achieved the state of the art performance in many NLP tasks.
However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, several fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we provide a \emph{unifying framework} that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we show that a sparse attention model can provably \emph{universally approximate} any sequence-to-sequence functions. Surprisingly, our results show the existence of attention models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns and levels of sparsity on standard NLP tasks.
View details
Are Transformers universal approximators of sequence-to-sequence functions?
Chulhee Yun
International Conference on Learning Representations (ICLR) (2020)
Preview abstract
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other architectures that can compute contextual mappings and empirically evaluate them.
View details