Ananda Theertha Suresh

Ananda Theertha Suresh

Ananda Theertha Suresh is a research scientist at Google. He obtained PhD from University of California, San Diego where he was advised by Prof. Alon Orlitsky. His research interests lie in the intersection of machine learning, information theory, and statistics. More details can be found at theertha.info
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Federated learning has been widely used to train automatic speech recognition models, where the training procedure is decentralized to client devices to avoid data privacy concerns by keeping the training data locally. However, the limited computation resources on client devices prevent training with large models. Recently, quantization-aware training has shown the potential to train a quantized neural network with similar performance to the full-precision model while keeping the model size small and inference faster. However, these quantization methods will not save memory during training since they still keep the full-precision model. To address this issue, we propose a new quantization training framework for federated learning which saves the memory usage by training with quantized variables directly on local devices. We empirically show that our method can achieve comparable WER while only using 60% memory of the full-precision model. View details
    Efficient Language Model Architectures for Differentially Private Federated Learning
    Yanxiang Zhang
    Privacy Regulation and Protection in Machine Learning Workshop at ICLR 2024(2024) (to appear)
    Preview abstract Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data ever leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance. In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively. We propose a scale-invariant \emph{Coupled Input Forget Gate} (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments. We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models. Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms. Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy. View details
    Preview abstract We propose a practical maximum-likelihood-estimation framework for regression as an alternative to the typical approach of Empirical Risk Minimization (ERM) over a specific loss metric. Our approach is better suited to capture inductive biases in datasets, and can output post-hoc estimators at inference time that can optimize different types of loss metrics. We present theoretical evidence (in the fixed design setting) to demonstrate that our approach is always competitive with using ERM over the loss metric, and in many practical scenarios can be much superior to ERM. For time series forecasting, we propose an end-to-end MLE based training and inference approach that can flexibly capture various inductive biases, and optimize prediction accuracy for a variety of typical loss metrics, without having to choose a specific loss metric at training time. We demonstrate empirically that our method instantiated with a well-designed general purpose likelihood can obtain superior performance over ERM for a variety of time-series forecasting and regression datasets with different inductive biases and data distributions. View details
    Preview abstract Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature. View details
    Preview abstract We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple source domains and limited labeled data from the target domain. We analyze existing algorithms and propose an instance-optimal approach based on model selection. We provide efficient algorithms and empirically demonstrate the benefits of our approach. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Karan Singhal
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv(2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    FedJAX: Federated learning simulation with JAX
    Ke Wu
    1st NeurIPS Workshop on New Frontiers in Federated Learning (NFFL 2021)(2021)
    Preview abstract Federated learning is a machine learning technique that enables training across decentralized data. Recently, federated learning has become an active area of research due to an increased focus on privacy and security. In light of this, a variety of open source federated learning libraries have been developed and released. We introduce FedJAX, a JAX-based open source library for federated learning simulations that emphasizes ease-of-use in research. With its simple primitives for implementing federated learning algorithms, prepackaged datasets, models and algorithms, and fast simulation speed, FedJAX aims to make developing and evaluating federated algorithms faster and easier for researchers. Our benchmark results show that FedJAX can be used to train models with federated averaging on the EMNIST dataset in a few minutes and the Stack Overflow dataset in roughly an hour with standard hyperparameters using TPUs. View details
    Preview abstract Weighted finite automata (WFA) are often used to represent probabilistic models, such as n- gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leiber divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library. View details
    Preview abstract In distributed learning settings such as federated learning, the training algorithm can be potentially biased towards different clients. Mohri et al. (2019) proposed a domain-agnostic learning algorithm, where the model is optimized for any target distribution formed by a mixture of the client distributions in order to overcome this bias. They further proposed an algorithm for the cross-silo federated learning setting, where the number of clients is small. We consider this problem in the cross-device setting, where the number of clients is much larger. We propose a communication-efficient distributed algorithm called Agnostic Federated Averaging (or AgnosticFedAvg) to minimize the domain-agnostic objective proposed in (Mohri et al., 2019), which is amenable to other private mechanisms such as secure aggregation. We highlight two types of naturally occurring domains in federated learning and argue that AgnosticFedAvg performs well on both. To demonstrate the practical effectiveness of AgnosticFedAvg, we report positive results for large-scale language modeling tasks in both simulation and live experiments, where the latter involves training language models for Spanish virtual keyboard for millions of user devices. View details
    Preview abstract We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where he has at his disposal a large amount of labeled data from multiple source domains. We show that a new family algorithms based on model selection ideas benefit from very favorable guarantees in this scenario and discuss some theoretical obstacles affecting some alternative techniques. We also report the results of several experiments with our algorithms that demonstrate their practical effectiveness in several tasks View details