Jump to Content
H. Brendan McMahan

H. Brendan McMahan

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Federated Learning of Gboard Language Models with Differential Privacy
    Yanxiang Zhang
    Galen Andrew
    Jesse Rosenstock
    Yuanbo Zhang
    ACL industry track (2023) (to appear)
    Preview abstract We train language models (LMs) with federated learning (FL) and differential privacy (DP) in the Google Keyboard (Gboard). We apply the DP-Follow-the-Regularized-Leader (DP-FTRL)~\citep{kairouz21b} algorithm to achieve meaningfully formal DP guarantees without requiring uniform sampling of client devices. To provide favorable privacy-utility trade-offs, we introduce a new client participation criterion and discuss the implication of its configuration in large scale systems. We show how quantile-based clip estimation~\citep{andrew2019differentially} can be combined with DP-FTRL to adaptively choose the clip norm during training or reduce the hyperparameter tuning in preparation for training. With the help of pretraining on public data, we train and deploy more than twenty Gboard LMs that achieve high utility and $\rho-$zCDP privacy guarantees with $\rho \in (0.2, 2)$, with two models additionally trained with secure aggregation~\citep{bonawitz2017practical}. We are happy to announce that all the next word prediction neural network LMs in Gboard now have DP guarantees, and all future launches of Gboard neural network LMs will require DP guarantees. We summarize our experience and provide concrete suggestions on DP training for practitioners. View details
    Learning to Generate Image Embeddings with User-level Differential Privacy
    Maxwell D. Collins
    Yuxiao Wang
    Sewoong Oh
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023) (to appear)
    Preview abstract We consider training feature extractors with user-level differential privacy to map images to embeddings from large-scale supervised data. To achieve user-level differential privacy, federated learning algorithms are extended and applied to aggregate user partitioned data, together with sensitivity control and noise addition. We demonstrate a variant of federated learning algorithm with partial aggregation and private reconstruction can achieve strong privacy utility trade-offs. When a large scale dataset is provided, it is possible to train feature extractors with both strong utility and privacy guarantees by combining techniques such as public pretraining, virtual clients, and partial aggregation. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv (2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    Preview abstract Building privacy-preserving systems for machine learning and data science on decentralized data View details
    Preview abstract Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning. View details
    Preview abstract This paper presents the first consumer-scale next-word prediction (NWP) model trained with Federated Learning (FL) while leveraging the Differentially Private Federated Averaging (DP-FedAvg) technique. There has been prior work on building practical FL infrastructure, including work demonstrating the feasibility of training language models on mobile devices using such infrastructure. It has also been shown (in simulations on a public corpus) that it is possible to train NWP models with user-level differential privacy using the DP-FedAvg algorithm. Nevertheless, training production-quality NWP models with DP-FedAvg in a real-world production environment on a heterogeneous fleet of mobile phones requires addressing numerous challenges. For instance, the coordinating central server has to keep track of the devices available at the start of each round and sample devices uniformly at random from them, while ensuring \emph{secrecy of the sample}, etc. Unlike all prior privacy-focused FL work of which we are aware, for the first time we demonstrate the deployment of a differentially private mechanism for the training of a production neural network in FL, as well as the instrumentation of the production training infrastructure to perform an end-to-end empirical measurement of unintended memorization. View details
    Preview abstract To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data—of representative samples, of outliers, of misclassifications—is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning or refining human-provided labels. However, manual data inspection is risky for privacy-sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models—trained using federated methods and with formal differential privacy guarantees—can be used effectively to debug data issues even when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs. View details
    Federated Heavy Hitters with Differential Privacy
    Haicheng Sun
    Vivian (Wei) Li
    International Conference on Artificial Intelligence and Statistics (AISTATS) 2020
    Preview abstract The discovery of heavy hitters (most frequent items) in user-generated data streams drives improvements in the app and web ecosystems, but can incur substantial privacy risks if not done with care. To address these risks, we propose a distributed and privacy-preserving algorithm for discovering the heavy hitters in a population of user-generated data streams. We leverage the sampling property of our distributed algorithm to prove that it is inherently differentially private, without requiring additional noise. We also examine the trade-off between privacy and utility, and show that our algorithm provides excellent utility while also achieving strong privacy guarantees. A significant advantage of this approach is that it eliminates the need to centralize raw data while also avoiding the significant loss in utility incurred by local differential privacy. We validate our findings both theoretically, using worst-case analyses, and practically, using a Twitter dataset with 1.6M tweets and over 650k users. Finally, we carefully compare our approach to Apple's local differential privacy method for discovering heavy hitters. View details
    Privacy Amplification via Random Check-Ins
    Borja Balle
    Abhradeep Thakurta
    Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020
    Preview abstract Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, permit adding lower noise in DP-SGD than via na\"{\i}ve schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted --- constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion. In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). Our main contribution is the random check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client. It has privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, our method does not require server-initiated communication, or even knowledge of the population size. To our knowledge, this is the first privacy amplification tailored for a distributed learning framework, and it may have broader applicability beyond FL. Along the way, we extend privacy amplification by shuffling to incorporate $(\epsilon,\delta)$-DP local randomizers, and exponentially improve its guarantees. In practical regimes, this improvement allows for similar privacy and utility using data from an order of magnitude fewer users. View details
    Federated Learning with Autotuned Communication-Efficient Secure Aggregation
    Fariborz Salehi
    Jakub Konečný
    Marco Gruteser
    Asilomar Conference on Signals, Systems, and Computers (2019)
    Preview abstract Federated Learning enables mobile devices to collaboratively learn a shared inference model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. Existing work on federated learning with limited communication demonstrates how random rotation can enable users' model updates to be quantized much more efficiently, reducing the communication cost between users and the server. Meanwhile, secure aggregation enable the server to learn an aggregate of at least a threshold number of device's model contributions without observing any individual device's contribution in unaggregated form. In this paper, we highlight some of the challenges of setting the parameters for secure aggregation to achieve communication efficiency, especially in the the context of the aggressively quantized inputs enabled by random rotation. We then develop a recipe for auto-tuning communication-efficient secure aggregation, based on specific properties of random rotation and secure aggregation -- namely, the predictable distribution of vector entries post-rotation and the modular wrapping inherent in secure aggregation. We present both theoretical results and initial experiments. View details
    Towards Federated Learning at Scale: System Design
    Hubert Eichner
    Wolfgang Grieskamp
    Dzmitry Huba
    Vladimir Ivanov
    Chloé M Kiddon
    Jakub Konečný
    Stefano Mazzocchi
    Timon Van Overveldt
    David Petrou
    Jason Roselander
    SysML 2019
    Preview abstract Federated Learning is a distributed machine learning approach which enables training on a large corpus of data which never needs to leave user devices. We have spent some effort over the last two years building a scalable production system for FL. In this paper, we report about the resulting high-level design, sketching the challenges and the solutions, as well as touching the open problems and future directions. View details
    Advances and Open Problems in Federated Learning
    Brendan Avent
    Aurélien Bellet
    Mehdi Bennis
    Arjun Nitin Bhagoji
    Graham Cormode
    Rachel Cummings
    Rafael G.L. D'Oliveira
    Salim El Rouayheb
    David Evans
    Josh Gardner
    Adrià Gascón
    Phillip B. Gibbons
    Marco Gruteser
    Zaid Harchaoui
    Chaoyang He
    Lie He
    Zhouyuan Huo
    Justin Hsu
    Martin Jaggi
    Tara Javidi
    Gauri Joshi
    Mikhail Khodak
    Jakub Konečný
    Aleksandra Korolova
    Farinaz Koushanfar
    Sanmi Koyejo
    Tancrède Lepoint
    Yang Liu
    Prateek Mittal
    Richard Nock
    Ayfer Özgür
    Rasmus Pagh
    Ramesh Raskar
    Dawn Song
    Weikang Song
    Sebastian U. Stich
    Ziteng Sun
    Florian Tramèr
    Praneeth Vepakomma
    Jianyu Wang
    Li Xiong
    Qiang Yang
    Felix X. Yu
    Han Yu
    Arxiv (2019)
    Preview abstract Federated learning (FL) is a machine learning setting where many clients (e.g., mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and mitigates many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents a comprehensive list of open problems and challenges. View details
    Semi-Cyclic Stochastic Gradient Descent
    Hubert Eichner
    Tomer Koren
    Nathan Srebro
    Kunal Talwar
    Accepted to ICML 2019.
    Preview abstract We consider convex SGD updates with a block-cyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling. View details
    Preview abstract The decentralized nature of federated learning makes detecting and defending against adversarial attacks a challenging task. This paper focuses on backdoor attacks in the federated learning setting, where the goal of the adversary is to reduce the performance of the model on targeted tasks while maintaining a good performance on the main task. Unlike existing works, we allow non-malicious clients to have correctly labeled samples from the targeted tasks. We conduct a comprehensive study of backdoor attacks and defenses for the EMNIST dataset, a real-life, user-partitioned, and non-iid dataset. We observe that in the absence of defenses, the performance of the attack largely depends on the fraction of adversaries present and the “complexity” of the targeted task. Moreover, we show that norm clipping and “weak” differential privacy mitigate the attacks without hurting the overall performance. We have implemented the attacks and defenses in TensorFlow Federated (TFF), a TensorFlow framework for federated learning. In open sourcing our code, our goal is to encourage researchers to contribute new attacks and defenses and evaluate them on standard federated datasets. View details
    Preview abstract Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For d variables and n \approx d clients, the proposed method uses \cO(\log \log(nd)) bits of communication per client per coordinate and ensures constant privacy. We also improve previous analysis of the \emph{Binomial mechanism} showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest. View details
    Learning Differentially Private Recurrent Language Models
    Kunal Talwar
    Li Zhang
    International Conference on Learning Representations (ICLR) (2018)
    Preview abstract We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gradient descent. In particular, we add user-level privacy protection to the federated averaging algorithm, which makes "large step" updates from user-level data. Our work demonstrates that given a dataset with a sufficiently large number of users (a requirement easily met by even small internet-scale datasets), achieving differential privacy comes at the cost of increased computation, rather than in decreased utility as in most prior work. We find that our private LSTM language models are quantitatively and qualitatively similar to un-noised models when trained on a large dataset. View details
    Preview abstract In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges. View details
    Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization
    Blake Woodworth
    Jialei Wang
    Nathan Srebro
    Advances in Neural Information Processing Systems (NIPS) (2018)
    Preview abstract We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the “natural” algorithms are not known to be optimal. View details
    Preview abstract Communication on heterogeneous edge networks is a fundamental bottleneck in Federated Learning (FL), restricting both model capacity and user participation. To address this issue, we introduce two novel strategies to reduce communication costs: (1) the use of lossy compression on the global model sent server-to-client; and (2) Federated Dropout, which allows users to efficiently train locally on smaller subsets of the global model and also provides a reduction in both client-to-server communication and local computation. We empirically show that these strategies, combined with existing compression approaches for client-to-server communication, collectively provide up to a 14× reduction in server-to-client communication, a 1.7× reduction in local computation, and a 28× reduction in upload communication, all without degrading the quality of the final model. We thus comprehensively reduce FL's impact on client device resources, allowing higher capacity models to be trained, and a more diverse set of users to be reached. View details
    Preview abstract Motivated by the need for distributed optimization algorithms with low communication cost, we study communication efficient algorithms to perform distributed mean estimation. We study the scenarios in which each client sends one bit per dimension. We first show that for d dimensional data with n clients, a naive stochastic rounding approach yields a mean squared error Theta(d/n). We then show by applying a structured random rotation of the data (an O(dlogd) algorithm), the error can be reduced to O(logd/n). The methods we show in this paper do not depend on the distribution of the data. View details
    Communication-Efficient Learning of Deep Networks from Decentralized Data
    Eider Moore
    Seth Hampson
    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) (2017)
    Preview abstract Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent. View details
    Preview abstract We present tools for the analysis of Follow-The-Regularized-Leader (FTRL), Dual Averaging, and Mirror Descent algorithms when the regularizer (equivalently, prox-function or learning rate schedule) is chosen adaptively based on the data. Adaptivity can be used to prove regret bounds that hold on every round, and also allows for data-dependent regret bounds as in AdaGrad-style algorithms (e.g., Online Gradient Descent with adaptive per-coordinate learning rates). We present results from a large number of prior works in a unified manner, using a modular and tight analysis that isolates the key arguments in easily re-usable lemmas. This approach strengthens previously known FTRL analysis techniques to produce bounds as tight as those achieved by potential functions or primal-dual analysis. Further, we prove a general and exact equivalence between an arbitrary adaptive Mirror Descent algorithm and a correspond- ing FTRL update, which allows us to analyze any Mirror Descent algorithm in the same framework. The key to bridging the gap between Dual Averaging and Mirror Descent algorithms lies in an analysis of the FTRL-Proximal algorithm family. Our regret bounds are proved in the most general form, holding for arbitrary norms and non-smooth regularizers with time-varying weight. View details
    Preview abstract We design a novel, communication-efficient, failure-robust protocol for secure aggregation of high-dimensional data. Our protocol allows a server to collect an aggregate of user-held data from mobile devices in a privacy-preserving manner, and can be used, for example, in a federated learning setting, to aggregate user-provided model updates for a deep neural network. We prove the security of our protocol in the honest-but-curious and malicious server settings, and show that privacy is preserved even if an arbitrarily chosen subset of users drop out at any time. We evaluate the efficiency of our protocol and show, by complexity analysis and a concrete implementation, that its runtime and communication overhead remain low even on large data sets and client pools. For 16-bit input values, our protocol offers 1.73× communication expansion for 2^10 users and 2^20-dimensional vectors, and 1.98× expansion for 2^14 users and 2^24-dimensional vectors. View details
    On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches
    Úlfar Erlingsson
    Ian Goodfellow
    Ilya Mironov
    Kunal Talwar
    Li Zhang
    Proceedings of 30th IEEE Computer Security Foundations Symposium (CSF) (2017), pp. 1-6
    Preview abstract The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s. View details
    Practical Secure Aggregation for Federated Learning on User-Held Data
    Vladimir Ivanov
    Ben Kreuter
    Antonio Marcedone
    Sarvar Patel
    NIPS Workshop on Private Multi-Party Machine Learning (2016)
    Preview abstract Secure Aggregation is a class of Secure Multi-Party Computation algorithms wherein a group of mutually distrustful parties u ∈ U each hold a private value x_u and collaborate to compute an aggregate value, such as the sum_{u∈U} x_u, without revealing to one another any information about their private value except what is learnable from the aggregate value itself. In this work, we consider training a deep neural network in the Federated Learning model, using distributed gradient descent across user-held training data on mobile devices, wherein Secure Aggregation protects the privacy of each user’s model gradient. We identify a combination of efficiency and robustness requirements which, to the best of our knowledge, are unmet by existing algorithms in the literature. We proceed to design a novel, communication-efficient Secure Aggregation protocol for high-dimensional data that tolerates up to 1/3 users failing to complete the protocol. For 16-bit input values, our protocol offers 1.73x communication expansion for 2^10 users and 2^20-dimensional vectors, and 1.98x expansion for 2^14 users and 2^24 dimensional vectors. View details
    Deep Learning with Differential Privacy
    Andy Chu
    Ian Goodfellow
    Ilya Mironov
    Kunal Talwar
    Li Zhang
    23rd ACM Conference on Computer and Communications Security (ACM CCS) (2016), pp. 308-318
    Preview abstract Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality. View details
    Preview abstract We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal. A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimization, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of federated optimization. View details
    Preview abstract Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model with training data distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of utmost importance. In this paper, we propose two ways to reduce the uplink communication costs. The proposed methods are evaluated on the application of training a deep neural network to perform image classification. Our best approach reduces the upload communication required to train a reasonable model by two orders of magnitude. View details
    Federated Optimization: Distributed Optimization Beyond the Datacenter
    Jakub Konečný
    NIPS Optimization for Machine Learning Workshop (2015), pp. 5
    Preview abstract We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization. View details
    Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations
    Francesco Orabona
    Proceedings of the 27th Annual Conference on Learning Theory (COLT) (2014)
    Preview abstract We study algorithms for online linear optimization in Hilbert spaces, focusing on the case where the player is unconstrained. We develop a novel characterization of a large class of minimax algorithms, recovering, and even improving, several previous results as immediate corollaries. Moreover, using our tools, we develop an algorithm that provides a regret bound of $O(U \sqrt{T \log( U \sqrt{T} \log^2 T +1)})$, where $U$ is the $L_2$ norm of an arbitrary comparator and both $T$ and $U$ are unknown to the player. This bound is optimal up to $\sqrt{\log \log T}$ terms. When $T$ is known, we derive an algorithm with an optimal regret bound (up to constant factors). For both the known and unknown $T$ case, a Normal approximation to the conditional value of the game proves to be the key analysis tool. View details
    Preview abstract We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods. View details
    Estimation, Optimization, and Parallelism when Data is Sparse
    John C. Duchi
    Michael I. Jordan
    Advances in Neural Information Processing Systems (NIPS) (2013)
    Preview abstract We study stochastic optimization problems when the \emph{data} is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. View details
    Ad Click Prediction: a View from the Trenches
    Michael Young
    Dietmar Ebner
    Julian Grady
    Lan Nie
    Eugene Davydov
    Sharat Chikkerur
    Dan Liu
    Arnar Mar Hrafnkelsson
    Tom Boulos
    Jeremy Kubica
    Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013)
    Preview abstract Predicting ad click--through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system. View details
    Minimax Optimal Algorithms for Unconstrained Linear Optimization
    Jacob Abernethy
    Advances in Neural Information Processing Systems (NIPS) (2013)
    Preview abstract We design and analyze minimax-optimal algorithms for online linear optimization games where the player's choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player's and the adversary's optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. View details
    Large-Scale Learning with Less RAM via Randomization
    Michael Young
    Proceedings of the 30 International Conference on Machine Learning (ICML) (2013), pp. 10
    Preview abstract We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs. View details
    Preview abstract Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is R^n. Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point x* are known in advance. We present algorithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of x*. In particular, regret with respect to x* = 0 is constant. We then prove lower bounds showing that our guarantees are near-optimal in this setting. View details
    Open Problem: Better Bounds for Online Logistic Regression
    COLT/ICML Joint Open Problem Session, JMLR: Workshop and Conference Proceedings (2012)
    Preview abstract Known algorithms applied to online logistic regression on a feasible set of L2 diameter D achieve regret bounds like O(eD log T) in one dimension, but we show a bound of O(sqrt(D) + log T) is possible in a binary 1-dimensional problem. Thus, we pose the following question: Is it possible to achieve a regret bound for online logistic regression that is O(poly(D)log(T))? Even if this is not possible in general, it would be interesting to have a bound that reduces to our bound in the one-dimensional case. View details
    Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization
    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) (2011)
    Preview abstract We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative L1 penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset. View details
    Adaptive Bound Optimization for Online Convex Optimization
    Proceedings of the 23rd Annual Conference on Learning Theory (COLT) (2010)
    Preview abstract We introduce a new online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions observed so far. This is in contrast to previous algorithms that use a fixed regularization function such as L2-squared, and modify it only via a single time-dependent parameter. Our algorithm's regret bounds are worst-case optimal, and for certain realistic classes of loss functions they are much better than existing bounds. These bounds are problem-dependent, which means they can exploit the structure of the actual problem instance. Critically, however, our algorithm does not need to know this structure in advance. Rather, we prove competitive guarantees that show the algorithm provides a bound within a constant factor of the best possible bound (of a certain functional form) in hindsight. View details
    Tighter Bounds for Multi-Armed Bandits with Expert Advice
    Proceedings of the 22nd Annual Conference on Learning Theory (COLT) (2009)
    Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards
    Varun Kanade
    Proceedings of the 12th International Conference on Artificial Intelligence and Statistic (AISTATS) (2009)
    Robust Submodular Observation Selection
    Andreas Krause
    Carlos Guestrin
    Anupam Gupta
    Journal of Machine Learning Research (JMLR), vol. 9 (2008), pp. 2761-2801
    Selecting Observations Against Adversarial Objectives
    Andreas Krause
    Carlos Guestrin
    Anupam Gupta
    Advances in Neural Information Processing Systems (NIPS 2007)
    Efficiently Computing Minimax Expected-Size Confidence Regions
    Chad M. Schafer
    Jeff Schneider
    Proc. 24th ICML, ACM, Corvalis (2007), pp. 97-104
    A Fast Bundle-based Anytime Algorithm for Poker and other Convex Games
    Geoffrey J. Gordon
    Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS) (2007)
    A Unification of Extensive-Form Games and Markov Decision Processes
    Geoffrey J. Gordon
    AAAI 2007
    Robust Planning in Domains with Stochastic Outcomes, Adversaries, and Partial Observability
    Bounded Real-Time Dynamic Programming: RTDP with monotone upper bounds and performance guarantees
    Maxim Likhachev
    Geoffrey J. Gordon
    Proceedings of the 22nd International Conference on Machine Learning (ICML) (2005)
    Online convex optimization in the bandit setting: gradient descent without a gradient
    Abraham Flaxman
    Adam Tauman Kalai
    Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2005)
    Generalizing Dijkstra's algorithm and Gaussian Elimination for solving MDPs
    Geoffrey J. Gordon
    Carnegie Mellon University (2005)
    Fast Exact Planning in Markov Decision Processes
    Geoffrey J. Gordon
    International Conference on Automated Planning and Scheduling (ICAPS) (2005)
    Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary
    Avrim Blum
    Proceedings of the Seventeenth Annual Conference on Learning Theory (COLT) (2004), pp. 109-123
    Multi-source spanning trees: algorithms for minimizing source eccentricities
    A. Proskurowski
    Discrete Applied Mathematics, vol. 137/2 (2003), pp. 213-222
    Planning in the Presence of Cost Functions Controlled By An Adversary
    Geoffrey J. Gordon
    Avrim Blum
    In Proceedings of the 20th International Conference on Machine Learning (ICML) (2003)