It's My Data Too: Private ML for Datasets with Multi-User Training Examples

Fan Wu
Arun Ganesh
Adam Smith
Ryan McKenna
ICML 2025 (2025) (to appear)

Abstract

We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example is associated with multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Next we study the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding algorithm. We then study this algorithm for a synthetic logistic regression task and a transformer training task, including studying a number of variants of this baseline algorithm that to optimize the subset chosen in various ways. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of the bias-variance tradeoff inherent in the contribution bounding problem.