Ananda Theertha Suresh
Ananda Theertha Suresh is a research scientist at Google. He obtained PhD from University of California, San Diego where he was advised by Prof. Alon Orlitsky. His research interests lie in the intersection of machine learning, information theory, and statistics. More details can be found at theertha.info
Research Areas
Authored Publications
Sort By
Preview abstract
Suppose Alice has a distribution ${P}$ and Bob has a distribution ${Q}$. Alice wants to draw a sample $a\sim {P}$ and Bob a sample $b \sim {Q}$ such that $a = b$ with as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $\Pr[a = b] = 1 - D_{\text{tv}}({P},{Q})$, where $D_{\text{tv}}({P},{Q})$ is the total variation distance between ${P}$ and ${Q}$.
What if Alice and Bob must solve this same problem \emph{without communicating at all?} Perhaps surprisingly, with access to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 - D_{\text{tv}}({P},{Q})}{1 + D_{\text{tv}}({P},{Q})} \geq 1-2D_{\text{tv}}({P},{Q})$ using a simple protocol based on the Weighted MinHash algorithm. This bound was shown to be optimal in the worst-case by Bavarian, Ghazi, Haramaty, Kamath, Rivest, and Sudan [ToC 2020].
In this work, we revisit the ``communication-free coupling'' problem. We provide a simpler proof of the optimality result from [Bavarian et al., 2020]. Moreover we show that, while the \emph{worst-case} success probability of Weighted MinHash cannot be improved, an equally simple protocol based on Gumbel sampling offers a Pareto improvement: for every pair of distributions ${P}$ and ${Q}$, Gumbel sampling achieves an equal or higher value of $\Pr[a = b]$ than Weighted MinHash.
Importantly, this improvement translates to practice. We demonstrate an application of communication-free coupling to \emph{speculative decoding}, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023].
We show that communication-free protocols can be used to contruct \emph{\CSD{}} schemes, which have the desirable property that their output is fixed given a fixed random seed, regardless of what drafter is used for speculation. In experiments on a language generation task, Gumbel sampling outperforms Weighted MinHash.
Code is available at \url{https://github.com/majid-daliri/DISD}.
Finally, we study the coupling problem in the setting where communication is \emph{bounded}, rather than completely eliminated. We describe a protocol that uses just $O(\log(n/\epsilon))$ bits of communication to achieve $\Pr[a = b] = 1 - D_{\text{tv}}({P},{Q}) - \epsilon$, i.e. to essentially match optimal coupling.
View details
Preview abstract
In the private set union problem each user owns a bag of at most $k$ items (from some large universe of items), and we are interested in computing the union of the items in the bags of all of the users. This is trivial without privacy, but a differentially private algorithm must be careful about reporting items contained in only a small number of bags. We consider differentially private algorithms that always report a subset of the union, and define the utility of an algorithm to be the expected size of the subset that it reports.
Because the achievable utility varies significantly with the dataset, we introduce the utility ratio, which normalizes utility by a dataset-specific upper bound and characterizes a mechanism by its lowest normalized utility across all datasets. We then develop algorithms with guaranteed utility ratios and complement them with bounds on the best possible utility ratio. Prior work has shown that a single algorithm can be simultaneously optimal for all datasets when $k=1$, but we show that instance-optimal algorithms do not exist when $k>1$, and characterize how performance degrades as $k$ grows. At the same time, we design a private algorithm that achieves the maximum possible utility, regardless of $k$, when the item histogram matches a prior prediction (for instance, from a previous data release) and degrades gracefully with the $\ell_\infty$ distance between the prediction and the actual histogram when the prediction is imperfect.
View details
Block level verification accelerates speculative decoding
Ziteng Sun
Uri Mendlovic
Yaniv Leviathan
Asaf Aharoni
Ahmad Beirami
2025
Preview abstract
Speculative decoding has shown to be an effective method for lossless acceleration of large language models during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the distribution of the final output is identical to that of the distribution of the large model. In prior speculative decoding works, the draft verification is performed token-by-token independently. Somewhat surprisingly, we show that this approach is sub-optimal. We propose a simple, easy-to-implement improved draft verification algorithm that provides additional wall-clock speedup by verifying the entire blocks jointly without incurring additional computation cost and draft tokens. We show that the proposed mechanism is never worse than the standard token-level verification and optimal in the expected number of accepted tokens. We also provide another variant verification algorithm which might be of independent interest. We empirically evaluate our proposed block-level verification algorithm in a wide range of tasks and datasets, and observe consistent improvements in wall-clock speedup when compared to the standard token-level verification algorithm. While the improvements are not huge, the change is minimal and adds no code complexity or other overhead. We recommend our block verification algorithm be used by default by all speculative decoding implementations.
View details
FEDAQT: ACCURATE QUANTIZED TRAINING WITH FEDERATED LEARNING
Renkun Ni
Yonghui Xiao
Oleg Rybakov
Phoenix Meadowlark
Tom Goldstein
2024
Preview abstract
Federated learning has been widely used to train automatic speech recognition models, where the training procedure is decentralized to client devices to avoid data privacy concerns by keeping the training data locally. However, the limited computation resources on client devices prevent training with large models. Recently, quantization-aware training has shown the potential to train a quantized neural network with similar performance to the full-precision model while keeping the model size small and inference faster. However, these quantization methods will not save memory during training since they still keep the full-precision model. To address this issue, we propose a new quantization training framework for federated learning which saves the memory usage by training with quantized variables directly on local devices. We empirically show that our method can achieve comparable WER while only using 60% memory of the full-precision model.
View details
Preview abstract
Suppose Alice has a distribution ${P}$ and Bob has a distribution ${Q}$. Alice wants to draw a sample $a\sim {P}$ and Bob a sample $b \sim {Q}$ such that $a = b$ with as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $\Pr[a = b] = 1 - D_{\text{tv}}({P},{Q})$, where $D_{\text{tv}}({P},{Q})$ is the total variation distance between ${P}$ and ${Q}$.
What if Alice and Bob must solve this same problem \emph{without communicating at all?} Perhaps surprisingly, with access to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 - D_{\text{tv}}({P},{Q})}{1 + D_{\text{tv}}({P},{Q})} \geq 1-2D_{\text{tv}}({P},{Q})$ using a simple protocol based on the Weighted MinHash algorithm. This bound was shown to be optimal in the worst-case by Bavarian, Ghazi, Haramaty, Kamath, Rivest, and Sudan [ToC 2020].
In this work, we revisit the ``communication-free coupling'' problem. We provide a simpler proof of the optimality result from [Bavarian et al., 2020]. Moreover we show that, while the \emph{worst-case} success probability of Weighted MinHash cannot be improved, an equally simple protocol based on Gumbel sampling offers a Pareto improvement: for every pair of distributions ${P}$ and ${Q}$, Gumbel sampling achieves an equal or higher value of $\Pr[a = b]$ than Weighted MinHash.
Importantly, this improvement translates to practice. We demonstrate an application of communication-free coupling to \emph{speculative decoding}, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023].
We show that communication-free protocols can be used to contruct \emph{\CSD{}} schemes, which have the desirable property that their output is fixed given a fixed random seed, regardless of what drafter is used for speculation. In experiments on a language generation task, Gumbel sampling outperforms Weighted MinHash.
Code is available at \url{https://github.com/majid-daliri/DISD}.
Finally, we study the coupling problem in the setting where communication is \emph{bounded}, rather than completely eliminated. We describe a protocol that uses just $O(\log(n/\epsilon))$ bits of communication to achieve $\Pr[a = b] = 1 - D_{\text{tv}}({P},{Q}) - \epsilon$, i.e. to essentially match optimal coupling.
View details
Efficient Language Model Architectures for Differentially Private Federated Learning
Zheng Xu
Yanxiang Zhang
Privacy Regulation and Protection in Machine Learning Workshop at ICLR 2024 (2024) (to appear)
Preview abstract
Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data ever leaving the devices.
SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency.
However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance.
In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively.
We propose a scale-invariant \emph{Coupled Input Forget Gate} (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell
and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments.
We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models.
Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms.
Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy.
View details
Scaling Language Model Size in Cross-Device Federated Learning
FL4NLP@ACL2022 (2022) (to appear)
Preview abstract
Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature.
View details
Preview abstract
We propose a practical maximum-likelihood-estimation framework for regression as an alternative to the typical approach of Empirical Risk Minimization (ERM) over a specific loss metric. Our approach is better suited to capture inductive biases in datasets, and can output post-hoc estimators at inference time that can optimize different types of loss metrics. We present theoretical evidence (in the fixed design setting) to demonstrate that our approach is always competitive with using ERM over the loss metric, and in many practical scenarios can be much superior to ERM. For time series forecasting, we propose an end-to-end MLE based training and inference approach that can flexibly capture various inductive biases, and optimize prediction accuracy for a variety of typical loss metrics, without having to choose a specific loss metric at training time. We demonstrate empirically that our method instantiated with a well-designed general purpose likelihood can obtain superior performance over ERM for a variety of time-series forecasting and regression datasets with different inductive biases and data distributions.
View details
Preview abstract
In distributed learning settings such as federated learning, the training algorithm can be potentially biased towards different clients. Mohri et al. (2019) proposed a domain-agnostic learning algorithm, where the model is optimized for any target distribution formed by a mixture of the client distributions in order to overcome this bias. They further proposed an algorithm for the cross-silo federated learning setting, where the number of clients is small. We consider this problem in the cross-device setting, where the number of clients is much larger. We propose a communication-efficient distributed algorithm called Agnostic Federated Averaging (or AgnosticFedAvg) to minimize the domain-agnostic objective proposed in (Mohri et al., 2019), which is amenable to other private mechanisms such as secure aggregation. We highlight two types of naturally occurring domains in federated learning and argue that AgnosticFedAvg performs well on both. To demonstrate the practical effectiveness of AgnosticFedAvg, we report positive results for large-scale language modeling tasks in both simulation and live experiments, where the latter involves training language models for Spanish virtual keyboard for millions of user devices.
View details
A Field Guide to Federated Optimization
Jianyu Wang
Zheng Xu
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Satyen Chandrakant Kale
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Manzil Zaheer
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details