Sanjiv Kumar
Hi! I work in the area of large-scale machine learning and computer vision. You can find more information about me including a complete list of papers at: www.sanjivk.com.
Authored Publications
Sort By
Preview abstract
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
View details
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Dylan Sam
Afshin Rostamizadeh
Gui Citovsky
Advances in Neural Information Processing Systems (NeurIPS) (2025) (to appear)
Preview abstract
Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
View details
Better autoregressive regression with LLMs via regression-aware fine-tuning
Zhao Meng
Aditya Menon
The Thirteenth International Conference on Learning Representations (2025)
Preview abstract
Decoder-based large language models (LLMs) have proven highly versatile, with remarkable successes even on problems ostensibly removed from traditional language generation. One such example is solving regression problems, where the targets are real numbers rather than textual tokens. A common approach to use LLMs on such problems is to perform fine-tuning based on the cross-entropy loss, and use autoregressive sampling at inference time. Another approach relies on fine-tuning a separate predictive head with a suitable loss such as squared error. While each approach has had success, there has been limited study on principled ways of using decoder LLMs for regression. In this work, we compare different prior works under a unified view, and introduce regression-aware fine-tuning(RAFT), a novel approach based on the Bayes-optimal decision rule. We demonstrate how RAFT improves over established baselines on several benchmarks and model families.
View details
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Lin Chen
Aditya Menon
Forty-second International Conference on Machine Learning (2025)
Preview abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
View details
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Lin Chen
Aditya Menon
2025
Preview abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
View details
Preview abstract
Large language models (LLMs) have shown strong results on a range of applications,
including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model’s output distribution. We show that this inference
strategy can be sub-optimal for common regression and scoring evaluation metrics. As a
remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate
inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.
View details
DistillSpec: Improving speculative decoding via knowledge distillation
Yongchao Zhou
Kaifeng Lyu
Aditya Menon
Afshin Rostamizadeh
Jean-François Kagy
Rishabh Agarwal
International Conference on Learning Representations (ICLR) (2024)
Preview abstract
Speculative decoding proves highly effective in expediting Large Language Model inference by employing a smaller draft model for token generation and a larger model for parallel token verification. Nonetheless, identifying an accurate and compact draft model aligned with the target model presents challenges. To address this, we propose leveraging white-box knowledge distillation, significantly improving draft model alignment with the larger target model, thereby enhancing speculative decoding. Our findings underscore the pivotal role of on-policy data generation and a suitable divergence function tailored to the task and decoding scheme for successful distillation. In practice, our refined distillation approach yields 20\% speedup over standard speculative decoding across five distinct tasks, using both greedy decoding and temperature sampling. Furthermore, we extend the concept of lossless speculative decoding to incorporate a lenience factor in the rejection sampling step, offering fine-grained control over the trade-off between quality and latency in lossy decoding. Finally, adopting a strategy of "distilling for performance first and distillation for speculative decoding second" enables a remarkable 8x reduction in latency with minimal performance compromise, compared to no distillation and speculative decoding baseline.
View details
Think before you speak: Training language models with pause tokens
Sachin Goyal
Ziwei Ji
Aditya Menon
Vaishnavh Nagarajan
International Conference on Learning Representations (ICLR) (2024)
Preview abstract
The present-day language model generates its response by producing a series of tokens in immediate succession: the $K+1$th token is an outcome of manipulating exactly $K$ hidden values in each layer corresponding to each of the $K$ previous tokens. Is it possible to somehow allow the model to manipulate more hidden values before committing to an answer? If yes, would this help? We explore these questions by training models with learnable \textit{pause} tokens. Besides feeding the usual prefix to the model, our idea is to feed the model with an additional sequence of pause tokens. On these tokens, the model's output is ignored all the way until the last pause token, where we begin extracting the answer. We explore this idea of ``delayed answering'' in a 1B model, where we consider both pre-training and/or fine-tuning with pause tokens. We find that while merely finetuning a standard model is not very helpful, pause-pretrained models shows promise on some downstream tasks such as GSM (reasoning) and Squad, CommonSenseQA and Lambada (question-answering tasks). We also conduct various ablations to explore the effect of the number of pause tokens. While our work takes a preliminary exploration in delayed computations for language models by focusing on a 1B model, we hope it inspires future work that can make this idea practically feasible without pre-training and for models trained with other pretraining objectives and other sizes.
View details
Language Model Cascades: Token-Level Uncertainty And Beyond
Neha Gupta
Aditya Menon
International Conference on Learning Representations (2024)
Preview abstract
Recent advances in language model (LM) design has yielded a series of models with remarkably improved quality on complex NLP tasks, but significantly in-creased inference cost. A simple strategy to achieve more favourable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a large model is invoked for a few “hard” instances. Typically, “easy” in-stances are those where the small model has high confidence in its prediction.While the principles underpinning effective cascading are well-studied for classification problems, a similar understanding is lacking for generative tasks. The ex-tension of simple ”Chow” rule which defers based on the probability of predicting an answer is not straightforward for generative tasks where the number of output tokens is variable. Moreover, LMs are known to suffer from length bias where longer answers are penalized more as compared to shorter answers which complicates things further. In this work, we initiate a systematic study of deferral rules for cascades for language models. For example, how does one best summarise model confidence across a variable number of output tokens? We show experimentally that there is no one straight forward extension of probability based uncertainty for LMs which works well across all tasks. Via experiments on a range of bench-marks with FLAN-T5 models, we find that incorporating token-level uncertainty can significantly improve the cost-quality tradeoff of cascades. We further show that incorporating embeddings from the smaller model and intermediate layer embeddings from the larger model can further boost performance
View details
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
Yuchen Li
Alexandre Kirchmeyer
Aashay Mehta
Yilong Qin
Andrej Risteski
International Conference on Machine Learning (2024) (to appear)
Preview abstract
Autoregressive language models are the currently dominant paradigm for text generation, however they have some fundamental limitations that cannot be remedied by scale ---for example inherently sequential and unidirectional generation. While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations. In this paper we focus on Generative Masked Language Models (GMLMs), a non-autoregressive paradigm in which we train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. These models empirically strike a promising speed-quality trade-off as each step can be typically parallelized by decoding the entire sequence in parallel. We develop a mathematical framework for analyzing and improving such models which sheds light on questions of sample complexity and inference speed and quality. Empirically, we adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality compared with autoregressive models. We run careful ablation experiments to give recommendations on key design choices, and make fine-grained observations on the common error modes in connection with our theory. Our mathematical analyses and empirical observations characterize both potentials and limitations of this approach, and can be applied to future works on improving understanding and performance of GMLMs.
View details