Úlfar Erlingsson
Úlfar Erlingsson has worked across many areas of computer systems, security, privacy and machine learning. He is currently focused on the security of Cloud software.
Authored Publications
Sort By
Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity
Vitaly Feldman
Ilya Mironov
Ananth Raghunathan
Kunal Talwar
Abhradeep Thakurta
ACM-SIAM Symposium on Discrete Algorithms (SODA) (2019)
Preview abstract
Sensitive statistics are often collected across sets of users, with repeated collection of reports done over time. For example, trends in users' private preferences or software usage may be monitored via such reports. We study the collection of such statistics in the local differential privacy (LDP) model, and describe an algorithm whose privacy cost is polylogarithmic in the number of changes to a user's value.
More fundamentally—by building on anonymity of the users' reports—we also demonstrate how the privacy cost of our LDP algorithm can actually be much lower when viewed in the central model of differential privacy. We show, via a new and general privacy amplification technique, that any permutation-invariant algorithm satisfying ε-local differential privacy will satisfy O(ε log(1/δ)/√n, δ)-central differential privacy. By this, we explain how the high noise and √n overhead of LDP protocols is a consequence of them being significantly more private in the central model. As a practical corollary, our results imply that several LDP-based industrial deployments may have much lower privacy cost than their advertised ε would indicate—at least if reports are anonymized.
View details
Preview abstract
This paper describes a testing methodology for quantitatively assessing the risk of \emph{unintended memorization} of rare or unique sequences in generative sequence models---a common type of neural network. Such models are sometimes trained on sensitive data (e.g., the text of users' private messages); our methodology allows deep-learning to choose configurations that minimize memorization during training, thereby benefiting privacy.
In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, if not addressed during training, we show that new, efficient procedures can allow extracting unique, secret sequences such as credit card numbers from trained models. We also show that our testing strategy is practical and easy-to-apply, e.g., by describing its use for quantitatively preventing data exposure in a production, commercial neural network---a predictive email-composition assistant trained on millions of users' email messages.
View details
Reducing Permission Requests in Mobile Apps
Martin Pelikan
Giles Hogben
Proceedings of ACM Internet Measurement Conference (IMC) (2019)
Preview abstract
Users of mobile apps sometimes express discomfort or concerns with what they see as unnecessary or intrusive permission requests by certain apps. However encouraging mobile app developers to request fewer permissions is challenging because there are many reasons why permissions are requested; furthermore, prior work has shown it is hard to disambiguate the purpose of a particular permission with high certainty. In this work we describe a novel, algorithmic mechanism intended to discourage mobile-app developers from asking for unnecessary permissions. Developers are incentivized by an automated alert, or "nudge", shown in the Google Play Console when their apps ask for permissions that are requested by very few functionally-similar apps---in other words, by their competition. Empirically, this incentive is effective, with significant developer response since its deployment. Permissions have been redacted by 59% of apps that were warned, and this attenuation has occurred broadly across both app categories and app popularity levels. Importantly, billions of users' app installs from the Google Play have benefited from these redactions
View details
Preview abstract
This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model. Because such models are sometimes trained on sensitive data (e.g., the text of users' private messages), this methodology can benefit privacy by allowing deep-learning practitioners to select means of training that minimize such memorization.
In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, for models trained without consideration of memorization, we describe new, efficient procedures that can extract unique, secret sequences, such as credit card numbers. We show that our testing strategy is a practical and easy-to-use first line of defense, e.g., by describing its application to quantitatively limit data exposure in Google's Smart Compose, a commercial text-completion neural network trained on millions of users' email messages.
View details
A General Approach to Adding Differential Privacy to Iterative Training Procedures
Galen Andrew
Ilya Mironov
Steve Chien
NIPS (2018)
Preview abstract
In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.
View details
Scalable Private Learning with PATE
Ilya Mironov
Ananth Raghunathan
Kunal Talwar
International Conference on Learning Representations (ICLR) (2018)
Preview abstract
The rapid adoption of machine learning has increased concerns about the privacy implications of machine learning models trained on sensitive data, such as medical records or other personal information. To address those concerns, one promising approach is Private Aggregation of Teacher Ensembles, or PATE, which transfers to a "student" model the knowledge of an ensemble of "teacher" models, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers. However, PATE has so far been evaluated only on simple classification tasks like MNIST, leaving unclear its utility when applied to larger-scale learning tasks and real-world datasets.
In this work, we show how PATE can scale to learning tasks with large numbers of output classes and uncurated, imbalanced training data with errors. For this, we introduce new noisy aggregation mechanisms for teacher ensembles that are more selective and add less noise, and prove their tighter differential-privacy guarantees. Our new mechanisms build on two insights: the chance of teacher consensus is increased by using more concentrated noise and, lacking consensus, no answer need be given to a student. The consensus answers used are more likely to be correct, offer better intuitive privacy, and incur lower-differential privacy cost. Our evaluation shows our mechanisms improve on the original PATE on all measures, and scale to larger tasks with both high utility and very strong privacy (ε < 1.0).
View details
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Nicolas Papernot
Ian Goodfellow
Kunal Talwar
Proceedings of the International Conference on Learning Representations (2017)
Preview abstract
Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information.
To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as “teachers” for a “student” model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student’s privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student’s training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings.
Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.
View details
On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches
Ian Goodfellow
Ilya Mironov
Kunal Talwar
Li Zhang
Proceedings of 30th IEEE Computer Security Foundations Symposium (CSF) (2017), pp. 1-6
Preview abstract
The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s.
View details
Prochlo: Strong Privacy for Analytics in the Crowd
Andrea Bittau
Ilya Mironov
Ananth Raghunathan
David Lie
Ushasree Kode
Julien Tinnes
Bernhard Seefeld
Proceedings of the Symposium on Operating Systems Principles (SOSP) (2017), pp. 441-459
Preview abstract
The large-scale monitoring of computer users’ software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture—Encode, Shuffle, Analyze (ESA)—for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informed by our practical experiences with an existing, large deployment of privacy-preserving software monitoring.
With ESA, the privacy of monitored users’ data is guaranteed by its processing in a three-step pipeline. First, the data is encoded to control scope, granularity, and randomness. Second, the encoded data is collected in batches subject to a randomized threshold, and blindly shuffled, to break linkability and to ensure that individual data items get “lost in the crowd” of the batch. Third, the anonymous, shuffled data is analyzed by a specific analysis engine that further prevents statistical inference attacks on analysis results.
ESA extends existing best-practice methods for sensitive-data analytics, by using cryptography and statistical techniques to make explicit how data is elided and reduced in precision, how only common-enough, anonymous data is analyzed, and how this is done for only specific, permitted purposes. As a result, ESA remains compatible with the established workflows of traditional database analysis.
Strong privacy guarantees, including differential privacy, can be established at each processing step to defend against malice or compromise at one or more of those steps. Prochlo develops new techniques to harden those steps, including the Stash Shuffle, a novel scalable and efficient oblivious-shuffling algorithm based on Intel’s SGX, and new applications of cryptographic secret sharing and blinding. We describe ESA and Prochlo, as well as experiments that validate their ability to balance utility and privacy.
View details
Data-driven software security: Models and methods
IEEE Computer Security Foundations Symposium (2016)
Preview abstract
For computer software, our security models, policies,
mechanisms, and means of assurance were primarily conceived
and developed before the end of the 1970’s. However,
since that time, software has changed radically: it is thousands
of times larger, comprises countless libraries, layers, and services,
and is used for more purposes, in far more complex ways. It is
worthwhile to revisit our core computer security concepts. For
example, it is unclear whether the Principle of Least Privilege
can help dictate security policy, when software is too complex for
either its developers or its users to explain its intended behavior.
One possibility is to take an empirical, data-driven approach
to modern software, and determine its exact, concrete behavior
via comprehensive, online monitoring. Such an approach can be
a practical, effective basis for security—as demonstrated by its
success in spam and abuse fighting—but its use to constrain
software behavior raises many questions. In particular, three
questions seem critical. First, can we efficiently monitor the
details of how software is behaving, in the large? Second, is it
possible learn those details without intruding on users’ privacy?
Third, are those details a good foundation for security policies
that constrain how software should behave?
This paper outlines what a data-driven model for software
security could look like, and describes how the above three
questions can be answered affirmatively. Specifically, this paper
briefly describes methods for efficient, detailed software monitoring,
as well as methods for learning detailed software statistics
while providing differential privacy for its users, and, finally, how
machine learning methods can help discover users’ expectations
for intended software behavior, and thereby help set security
policy. Those methods can be adopted in practice, even at very
large scales, and demonstrate that data-driven software security
models can provide real-world benefits.
View details