Kareem Amin
Authored Publications
Sort By
Google’s Approach to Protecting Privacy in the Age of AI
Google, , 1600 Amphitheatre Parkway, Mountain View, CA, 94043 (2025)
Preview abstract
AI products introduce new privacy challenges. Finding the right privacy solution is central to developing innovative products, especially as AI models increasingly handle user data. In this paper, we propose a framework to reason about privacy in AI, and discuss how Privacy Enhancing Technologies (PETs) enable novel user experiences by reducing privacy risks in the AI development lifecycle. We argue that privacy protections are not inherently at odds with utility; in contrast, we discuss how building privacy into products from the start can create better, more trustworthy experiences for everyone.
View details
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Sara Babakniya
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
Preview abstract
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine
learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples --- in much the same way that boosting focuses the efforts of the weak learner --- leads to improved performance.
View details
Private prediction for large-scale synthetic text generation
Natalia Ponomareva
Findings of EMNLP 2024
Preview abstract
We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release.
We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.
View details
Preview abstract
Consider a setting where we wish to automate an expensive task with a machine
learning algorithm using a limited labeling resource. In such settings, examples
routed for labeling are often out of scope for the machine learning algorithm. For
example, in a spam detection setting, human reviewers not only provide labeled
data but are such high-quality detectors of spam that examples routed to them no
longer require machine evaluation. A consequence is that distribution of examples
routed to the machine is intimately tied to the process generating labels. We
introduce a formalization of this setting, and give an algorithm that simultaneously
learns a model and decides when to request a label by leveraging ideas from both
the abstention and active learning literatures. We prove an upper bound on the
algorithm’s label complexity and a matching lower bound for any algorithm in this
setting. We conduct a thorough set of experiments including an ablation study to
test different components of our algorithm. We demonstrate the effectiveness of an
efficient version of our algorithm over margin sampling on a variety of datasets.
View details
Preview abstract
A differentially private algorithm guarantees privacy against an adversary that sees the output of the algorithm. We study pan-privacy, which guarantees privacy against an adversary that sees both the output and any single internal state of the algorithm during its computation. First, we motivate the single-intrusion assumption by showing that pan-privacy against multiple intrusions is equivalent to sequentially interactive local privacy. Next, we contextualize pan-privacy by analyzing the sample complexity of uniformity testing. We show that this sample complexity sits strictly between that of the central and (sequentially interactive) local models.
View details
Understanding the Effects of Batching in Online Active Learning
Afshin Rostamizadeh
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (2020)
Preview abstract
Online active learning (AL) algorithms often assume immediate access to a label once a query has been made. However, due to practical constraints, the labels of these queried examples are generally only available in ``batches''. In this work, we present a novel analysis for a generic class of batch online AL algorithms and reveal that the effects of batching are in fact mild and only result in an additional term in the label complexity that is linear in the batch size. To our knowledge, this provides the first theoretical justification for such algorithms and we show how they can be applied to batch variants of three canonical online AL algorithms: IWAL, ORIWAL, and DHM. We also conduct an empirical study that corroborates the novel theoretical insights.
View details
Preview abstract
Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance—indeed, it is often assumed that each user contributes only a single example— we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost
of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.
View details
Preview abstract
We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.
View details
Preview abstract
Motivated by real-time advertising exchanges, we analyze the problem of pricing inventory in a repeated posted-price auction. We consider both the cases of a truthful and surplus-maximizing buyer, where the former makes decisions myopically on every round, and the latter may strategically react to our algorithm, forgoing short-term surplus in order to trick the algorithm into setting better prices in the future. We further assume a buyer’s valuation of a good is a function of a context vector that describes the good being sold. We give the first algorithm attaining sublinear (O(T^{
2/3})) regret in the contextual setting against a surplus-maximizing
buyer. We also extend this result to repeated second-price auctions with multiple buyers.
View details
Preview abstract
Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value for a good when the buyer is repeatedly interacting with the seller through a posted-price mechanism. We model the buyer as a strategic agent, interested in maximizing her long-term surplus, and are interested in optimizing seller revenue. We show conditions under which the seller cannot hope to gain an advantage by learning the buyer’s value – i.e. the buyer can always manipulate the exchange to hide her value. This result is accompanied by a seller algorithm that is able to achieve no-regret when the buyer is unable to incur the short-term costs of such manipulation.
View details