The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets
Abstract
Data cleaning and feature engineering are both common practices when developing
machine learning (ML) models. However, developers are not always aware of best
practices for preparing or transforming data for a given model type, which can lead
to suboptimal representations of input features. To address this issue, we introduce
the data linter, a new class of ML tool that automatically inspects ML data sets
to 1) identify potential issues in the data and 2) suggest potentially useful feature
transforms, for a given model type. As with traditional code linting, data linting
automatically identifies potential issues or inefficiencies; codifies best practices and
educates end-users about these practices through tool use; and can lead to quality
improvements. In this paper, we provide a detailed description of data linting,
describe our initial implementation of a data linter for deep neural networks, and
report results suggesting the utility of using a data linter during ML model design.
machine learning (ML) models. However, developers are not always aware of best
practices for preparing or transforming data for a given model type, which can lead
to suboptimal representations of input features. To address this issue, we introduce
the data linter, a new class of ML tool that automatically inspects ML data sets
to 1) identify potential issues in the data and 2) suggest potentially useful feature
transforms, for a given model type. As with traditional code linting, data linting
automatically identifies potential issues or inefficiencies; codifies best practices and
educates end-users about these practices through tool use; and can lead to quality
improvements. In this paper, we provide a detailed description of data linting,
describe our initial implementation of a data linter for deep neural networks, and
report results suggesting the utility of using a data linter during ML model design.