Sashank Reddi
Research Areas
Authored Publications
Sort By
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Lin Chen
Aditya Menon
2025
Preview abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal Area Under the ROC Curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem -- loss aggregation and label aggregation -- by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
View details
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Lin Chen
Aditya Menon
Forty-second International Conference on Machine Learning (2025)
Preview abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
View details
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Lin Chen
Aditya Menon
2025
Preview abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
View details
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
Vlad Feinberg
Afshin Rostamizadeh
Nikunj Saunshi
Seungyeon Kim
Veeru Sadhanala
Rakesh Shivanna
Rohan Anil
Aditya Menon
Hrayr Harutyunyan
ArXiv (2024)
Preview abstract
A primary challenge in large language model (LLM) development is their onerous pre-training cost. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional supervision, and (2) select a small subset of valuable training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to study the utility of SLMs in enabling efficient training of high-quality LLMs. Our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of LLMs with 2.8B and 8.6B parameters by utilizing smaller LMs on the Pile dataset.
View details
On Emergence of Activation Sparsity in Trained Transformers
Zonglin Li
Chong You
Daliang Li
Ke Ye
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
View details
Efficient Training of Language Models using Few-Shot Learning
Shankar Krishnan
Satyen Kale
Seungyeon Kim
ICML (2023)
Preview abstract
Large deep learning models have achieved state-of-the-art performance across various natural language processing (NLP) tasks and demonstrated remarkable few-shot learning performance. However, training them is often challenging and resource-intensive. In this paper, we study an efficient approach to train language models using few-shot learners. We show that, by leveraging the fast learning nature of few-shot learners, one can train language models efficiently in a stagewise manner. Our main insight is that stacking a good few-shot learner on a good small language model provides a good initializer for a larger language model. Using this insight and building upon progressive stacking approaches, we develop novel approaches for training such networks in a stagewise manner. Furthermore, we also provide a theoretical framework and accompanying empirical studies to support our insights, thereby creating a theoretical foundation for progressive stacking. Finally, we provide empirical results to demonstrate the effectiveness of our approach in reducing the training time of few-shot learners.
View details
In defense of dual-encoders for neural ranking
Aditya Krishna Menon
Sadeep Jayasumana
Seungyeon Kim
International Conference on Machine Learning (ICML) (2022)
Preview abstract
Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query. There are two broad flavours of such models: cross-attention (CA) models, which learn a joint embedding for the query and document, and dual-encoder (DE) models, which learn separate embeddings for the query and document. Empirically, CA models are often found to be more accurate, which has motivated a series of works seeking to bridge this gap. However, a more fundamental question remains less explored: does this performance gap reflect an inherent limitation in the capacity of DE models, or a limitation in the training of such models? And does such an understanding suggest a principled means of improving DE models? In this paper, we study these questions, with three contributions. First, we establish theoretically that with a sufficiently large embedding dimension, DE models have the capacity to model a broad class of score distributions. Second, we show empirically that on real-world problems, DE models may overfit to spurious correlations in the training set, and thus under-perform on test samples. To mitigate this behaviour, we propose a novel distillation strategy that leverages confidence margins, and confirm its practical efficacy on the MSMARCO-Passage benchmark.
View details
RankDistil: Distillation for Ranking
Aditya Krishna Menon
Seungyeon Kim
AISTATS 2021 (2021)
Preview abstract
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.
View details
Adaptive Federated Optimization
Manzil Zaheer
Jakub Konečný
(2021)
Preview abstract
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning.
View details
A statistical perspective on distillation
Aditya Krishna Menon
Seungyeon Kim
International Conference on Machine Learning (ICML) 2021
Preview abstract
Knowledge distillation is a technique for improving a ``student'' model by replacing its one-hot training labels with a label distribution obtained from a ``teacher'' model. Despite its broad success, several basic questions --- e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? --- have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a ``Bayes teacher'' providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a ``good'' teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.
View details