Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding the data and monitoring the quality of the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen in ML pipelines, on par with algorithms and infrastructure.
In this paper we focus on the problem of validating the input data fed to ML pipelines. Specifically, we demonstrate TensorFlow Data Validation (TFDV), a scalable data analysis and validation system developed at Google and open-sourced. This system is deployed in production as an integral part of TFX (Baylor et al., 2017) – an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well.