Jump to Content

From Data to Models and Back

Evan Rosen
Gene Huang
Mike Dreves
Neoklis Polyzotis
Zhuo Peng
ACM
Google Scholar

Abstract

Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only on the code for ML but also on the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models.