The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets

Nick Hynes; D. Sculley; Michael Terry

The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets

Nick Hynes

D. Sculley

Michael Terry

NIPS Workshop on ML Systems (2017)

Download Google Scholar

Abstract

Data cleaning and feature engineering are both common practices when developing
machine learning (ML) models. However, developers are not always aware of best
practices for preparing or transforming data for a given model type, which can lead
to suboptimal representations of input features. To address this issue, we introduce
the data linter, a new class of ML tool that automatically inspects ML data sets
to 1) identify potential issues in the data and 2) suggest potentially useful feature
transforms, for a given model type. As with traditional code linting, data linting
automatically identifies potential issues or inefficiencies; codifies best practices and
educates end-users about these practices through tool use; and can lead to quality
improvements. In this paper, we provide a detailed description of data linting,
describe our initial implementation of a data linter for deep neural networks, and
report results suggesting the utility of using a data linter during ML model design.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs