Google Research

Generative Models for Effective ML on Private, Decentralized Datasets

Abstract

To improve real-world applications of machine learning, experienced practitioners develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data - of representative samples, of outliers, of misclassifications, or alike - is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning human-provided labels. However, manual data inspection is risky for privacy-sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the practitioner may only access aggregated outputs such as metrics or model parameters. This paper outlines a research agenda to address data-oriented tooling needs of ML practitioners who work with privacy-sensitive or decentralized datasets. We demonstrate that generative models - trained using federated methods and with formal differential privacy guarantees - can be used to effectively debug data issues even when the data cannot be directly inspected.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work