Paul Suganthan

Paul Suganthan

I am a Software Engineer at Google Research since March 2018, working as a part of the Google Brain team to solve problems at the intersection of data management and machine learning (ML). Specifically, I am one of the core contributors of TensorFlow Data Validation, which is an open-source library that helps developers understand, validate, and monitor their ML data at scale.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Validating Data and Models in Continuous ML pipelines
    Evan Rosen
    Gene Huang
    Mike Dreves
    Neoklis Polyzotis
    Zhuo Peng
    IEEE TCDE (2021)
    Preview abstract Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only the code for ML but also the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models. View details
    From Data to Models and Back
    Evan Rosen
    Gene Huang
    Mike Dreves
    Neoklis Polyzotis
    Zhuo Peng
    ACM
    Preview abstract Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only on the code for ML but also on the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models. View details
    TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines
    Emily Caveness
    Marty Zinkevich
    Neoklis Polyzotis
    Sudip Roy
    Zhuo Peng
    SIGMOD (2020) (to appear)
    Preview abstract Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding the data and monitoring the quality of the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen in ML pipelines, on par with algorithms and infrastructure. In this paper we focus on the problem of validating the input data fed to ML pipelines. Specifically, we demonstrate TensorFlow Data Validation (TFDV), a scalable data analysis and validation system developed at Google and open-sourced. This system is deployed in production as an integral part of TFX (Baylor et al., 2017) – an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well. View details