Paul Suganthan
I am a Software Engineer at Google Research since March 2018, working as a part of the Google Brain team to solve problems at the intersection of data management and machine learning (ML). Specifically, I am one of the core contributors of TensorFlow Data Validation, which is an open-source library that helps developers understand, validate, and monitor their ML data at scale.
Research Areas
Authored Publications
Sort By
Preview abstract
Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only the code for ML but also the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines.
In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models.
View details
TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines
Emily Caveness
Marty Zinkevich
Neoklis Polyzotis
Sudip Roy
Zhuo Peng
SIGMOD (2020) (to appear)
Preview abstract
Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding the data and monitoring the quality of the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen in ML pipelines, on par with algorithms and infrastructure.
In this paper we focus on the problem of validating the input data fed to ML pipelines. Specifically, we demonstrate TensorFlow Data Validation (TFDV), a scalable data analysis and validation system developed at Google and open-sourced. This system is deployed in production as an integral part of TFX (Baylor et al., 2017) – an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well.
View details
Preview abstract
Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only on the code for ML but also on the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models.
View details