Úlfar Erlingsson
Úlfar Erlingsson has worked across many areas of computer systems, security, privacy and machine learning. He is currently focused on the security of Cloud software.
Authored Publications
Sort By
Reducing Permission Requests in Mobile Apps
Martin Pelikan
Giles Hogben
Proceedings of ACM Internet Measurement Conference (IMC) (2019)
Preview abstract
Users of mobile apps sometimes express discomfort or concerns with what they see as unnecessary or intrusive permission requests by certain apps. However encouraging mobile app developers to request fewer permissions is challenging because there are many reasons why permissions are requested; furthermore, prior work has shown it is hard to disambiguate the purpose of a particular permission with high certainty. In this work we describe a novel, algorithmic mechanism intended to discourage mobile-app developers from asking for unnecessary permissions. Developers are incentivized by an automated alert, or "nudge", shown in the Google Play Console when their apps ask for permissions that are requested by very few functionally-similar apps---in other words, by their competition. Empirically, this incentive is effective, with significant developer response since its deployment. Permissions have been redacted by 59% of apps that were warned, and this attenuation has occurred broadly across both app categories and app popularity levels. Importantly, billions of users' app installs from the Google Play have benefited from these redactions
View details
Preview abstract
This paper describes a testing methodology for quantitatively assessing the risk of \emph{unintended memorization} of rare or unique sequences in generative sequence models---a common type of neural network. Such models are sometimes trained on sensitive data (e.g., the text of users' private messages); our methodology allows deep-learning to choose configurations that minimize memorization during training, thereby benefiting privacy.
In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, if not addressed during training, we show that new, efficient procedures can allow extracting unique, secret sequences such as credit card numbers from trained models. We also show that our testing strategy is practical and easy-to-apply, e.g., by describing its use for quantitatively preventing data exposure in a production, commercial neural network---a predictive email-composition assistant trained on millions of users' email messages.
View details
Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity
Vitaly Feldman
Ilya Mironov
Ananth Raghunathan
Kunal Talwar
Abhradeep Thakurta
ACM-SIAM Symposium on Discrete Algorithms (SODA) (2019)
Preview abstract
Sensitive statistics are often collected across sets of users, with repeated collection of reports done over time. For example, trends in users' private preferences or software usage may be monitored via such reports. We study the collection of such statistics in the local differential privacy (LDP) model, and describe an algorithm whose privacy cost is polylogarithmic in the number of changes to a user's value.
More fundamentally—by building on anonymity of the users' reports—we also demonstrate how the privacy cost of our LDP algorithm can actually be much lower when viewed in the central model of differential privacy. We show, via a new and general privacy amplification technique, that any permutation-invariant algorithm satisfying ε-local differential privacy will satisfy O(ε log(1/δ)/√n, δ)-central differential privacy. By this, we explain how the high noise and √n overhead of LDP protocols is a consequence of them being significantly more private in the central model. As a practical corollary, our results imply that several LDP-based industrial deployments may have much lower privacy cost than their advertised ε would indicate—at least if reports are anonymized.
View details
Scalable Private Learning with PATE
Ilya Mironov
Ananth Raghunathan
Kunal Talwar
International Conference on Learning Representations (ICLR) (2018)
Preview abstract
The rapid adoption of machine learning has increased concerns about the privacy implications of machine learning models trained on sensitive data, such as medical records or other personal information. To address those concerns, one promising approach is Private Aggregation of Teacher Ensembles, or PATE, which transfers to a "student" model the knowledge of an ensemble of "teacher" models, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers. However, PATE has so far been evaluated only on simple classification tasks like MNIST, leaving unclear its utility when applied to larger-scale learning tasks and real-world datasets.
In this work, we show how PATE can scale to learning tasks with large numbers of output classes and uncurated, imbalanced training data with errors. For this, we introduce new noisy aggregation mechanisms for teacher ensembles that are more selective and add less noise, and prove their tighter differential-privacy guarantees. Our new mechanisms build on two insights: the chance of teacher consensus is increased by using more concentrated noise and, lacking consensus, no answer need be given to a student. The consensus answers used are more likely to be correct, offer better intuitive privacy, and incur lower-differential privacy cost. Our evaluation shows our mechanisms improve on the original PATE on all measures, and scale to larger tasks with both high utility and very strong privacy (ε < 1.0).
View details
Preview abstract
This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model. Because such models are sometimes trained on sensitive data (e.g., the text of users' private messages), this methodology can benefit privacy by allowing deep-learning practitioners to select means of training that minimize such memorization.
In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, for models trained without consideration of memorization, we describe new, efficient procedures that can extract unique, secret sequences, such as credit card numbers. We show that our testing strategy is a practical and easy-to-use first line of defense, e.g., by describing its application to quantitatively limit data exposure in Google's Smart Compose, a commercial text-completion neural network trained on millions of users' email messages.
View details
A General Approach to Adding Differential Privacy to Iterative Training Procedures
Galen Andrew
Ilya Mironov
Steve Chien
NIPS (2018)
Preview abstract
In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.
View details
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Nicolas Papernot
Ian Goodfellow
Kunal Talwar
Proceedings of the International Conference on Learning Representations (2017)
Preview abstract
Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information.
To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as “teachers” for a “student” model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student’s privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student’s training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings.
Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.
View details
On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches
Ian Goodfellow
Ilya Mironov
Kunal Talwar
Li Zhang
Proceedings of 30th IEEE Computer Security Foundations Symposium (CSF) (2017), pp. 1-6
Preview abstract
The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s.
View details
Prochlo: Strong Privacy for Analytics in the Crowd
Andrea Bittau
Ilya Mironov
Ananth Raghunathan
David Lie
Ushasree Kode
Julien Tinnes
Bernhard Seefeld
Proceedings of the Symposium on Operating Systems Principles (SOSP) (2017), pp. 441-459
Preview abstract
The large-scale monitoring of computer users’ software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture—Encode, Shuffle, Analyze (ESA)—for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informed by our practical experiences with an existing, large deployment of privacy-preserving software monitoring.
With ESA, the privacy of monitored users’ data is guaranteed by its processing in a three-step pipeline. First, the data is encoded to control scope, granularity, and randomness. Second, the encoded data is collected in batches subject to a randomized threshold, and blindly shuffled, to break linkability and to ensure that individual data items get “lost in the crowd” of the batch. Third, the anonymous, shuffled data is analyzed by a specific analysis engine that further prevents statistical inference attacks on analysis results.
ESA extends existing best-practice methods for sensitive-data analytics, by using cryptography and statistical techniques to make explicit how data is elided and reduced in precision, how only common-enough, anonymous data is analyzed, and how this is done for only specific, permitted purposes. As a result, ESA remains compatible with the established workflows of traditional database analysis.
Strong privacy guarantees, including differential privacy, can be established at each processing step to defend against malice or compromise at one or more of those steps. Prochlo develops new techniques to harden those steps, including the Stash Shuffle, a novel scalable and efficient oblivious-shuffling algorithm based on Intel’s SGX, and new applications of cryptographic secret sharing and blinding. We describe ESA and Prochlo, as well as experiments that validate their ability to balance utility and privacy.
View details
Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries
Giulia Fanti
Vasyl Pihur
Proceedings on Privacy Enhancing Technologies (PoPETS), issue 3, 2016 (2016)
Preview abstract
Techniques based on randomized response enable the collection of potentially sensitive data from clients in a privacy-preserving manner with strong local differential privacy guarantees. A recent such technology, RAPPOR, enables estimation of the marginal frequencies of a set of strings via privacy-preserving crowdsourcing. However, this original estimation process relies on a known dictionary of possible strings; in practice, this dictionary can be extremely large and/or unknown. In this paper, we propose a novel decoding algorithm for the RAPPOR mechanism that enables the estimation of "unknown unknowns,'' i.e., strings we do not know we should be estimating. To enable learning without explicit dictionary knowledge, we develop methodology for estimating the joint distribution of multiple variables collected with RAPPOR. Our contributions are not RAPPOR-specific, and can be generalized to other local differential privacy mechanisms for learning distributions of string-valued random variables.
View details