Data Validation for Machine Learning

Eric Breck; Marty Zinkevich; Neoklis Polyzotis; Steven Whang; Sudip Roy

Data Validation for Machine Learning

Eric Breck

Marty Zinkevich

Neoklis Polyzotis

Steven Whang

Sudip Roy

Proceedings of SysML (2019) (to appear)

Download Google Scholar

Listen with Illuminate

Abstract

Machine learning is a powerful tool for gleaning knowledge from massive amounts of data.
While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. This argument points to a data-centric approach to machine learning that treats
training and serving data as an important production asset, on par with the algorithm and infrastructure used for learning.

In this paper, we tackle this problem and present a data validation system that is designed to detect anomalies specifically in data fed into machine learning pipelines.
This system is deployed in production as an integral part of TFX (Baylor, 2017) -- an end-to-end machine learning platform at Google. It is used by hundreds of product teams use it to continuously monitor and validate several petabytes of production data per day.
We faced several challenges in developing our system, most notably around the ability of ML pipelines to soldier on in the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, the techniques we used to address them,
and the various design choices that we made in implementing the system. Finally,
we present evidence from the system's deployment in production that illustrate the tangible benefits of data validation in the context of ML: early detection of errors, model-quality wins from using better data, savings in engineering hours to debug problems, and a shift towards data-centric workflows in model development.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Data Validation for Machine Learning

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs