Eric Breck
I am interested in practical issues in large-scale application of machine learning, including considerations of fairness as well as testing and validation of ML systems.
Research Areas
Authored Publications
Sort By
Data Validation for Machine Learning
Marty Zinkevich
Neoklis Polyzotis
Steven Whang
Sudip Roy
Proceedings of SysML (2019) (to appear)
Preview abstract
Machine learning is a powerful tool for gleaning knowledge from massive amounts of data.
While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. This argument points to a data-centric approach to machine learning that treats
training and serving data as an important production asset, on par with the algorithm and infrastructure used for learning.
In this paper, we tackle this problem and present a data validation system that is designed to detect anomalies specifically in data fed into machine learning pipelines.
This system is deployed in production as an integral part of TFX (Baylor, 2017) -- an end-to-end machine learning platform at Google. It is used by hundreds of product teams use it to continuously monitor and validate several petabytes of production data per day.
We faced several challenges in developing our system, most notably around the ability of ML pipelines to soldier on in the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, the techniques we used to address them,
and the various design choices that we made in implementing the system. Finally,
we present evidence from the system's deployment in production that illustrate the tangible benefits of data validation in the context of ML: early detection of errors, model-quality wins from using better data, savings in engineering hours to debug problems, and a shift towards data-centric workflows in model development.
View details
The Inclusive Images Competition
Igor Ivanov
Miha Skalic
Pallavi Baljekar
Pavel Ostyakov
Roman Solovyev
Weimin Wang
Yoni Halpern
Springer Series (2019)
Preview abstract
Popular large image classification datasets that are drawn from the web present Eurocentric and Americentric biases that negatively impact the generalizability of models trained on them . In order to encourage the development of modeling approaches that generalize well to images drawn from locations and cultural contexts that are unseen or poorly represented at the time of training, we organized the Inclusive Images competition in association with Kaggle and the NeurIPS 2018 Competition Track Workshop. In this chapter, we describe the motivation and design of the competition, present reports from the top three competitors, and provide high-level takeaways from the competition results.
View details
No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
Shreya Shankar
Yoni Halpern
NIPS 2017 workshop: Machine Learning for the Developing World
Preview abstract
Modern machine learning systems such as image classifers rely heavily on
large scale data sets for training. Such data sets are costly to create,
thus in practice a small number of freely available, open source data sets
are widely used. Such strategies may be particularly important for ML
applications in the developing world, where resources may be constrained
and the cost of creating suitable large scale data sets may be a
blocking factor. However, we suggest that examining the {\em geo-diversity}
of open data sets is critical before adopting a data set for such use
cases. In particular, we analyze two large, publicly available image
data sets to assess geo-diversity and find that these data sets appear
to exhibit a observable amerocentric and eurocentric representation bias.
Further, we perform targeted analysis on classifiers that use these data
sets as training data to assess the impact of these training distributions,
and find strong differences in the relative performance on images from
different locales. These results emphasize the need to ensure
geo-representation when constructing data sets for use in the developing
world.
View details
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
Akshay Naresh Modi
Chiu Yuen Koo
Chuan Yu Foo
Clemens Mewald
Denis M. Baylor
Jarek Wilkiewicz
Levent Koc
Lukasz Lew
Martin A. Zinkevich
Mustafa Ispir
Neoklis Polyzotis
Steven Whang
Sudip Roy
Sukriti Ramesh
Vihan Jain
Xin Zhang
Zakaria Haque
KDD 2017
Preview abstract
Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt.
We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions.
We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.
View details
Preview abstract
Creating reliable, production-level machine learning systems brings on a host of concerns not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing technical debt of ML systems. But it can be difficult to formulate specific tests, given that the actual prediction behavior of any given model is difficult to specify a priori. In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt.
View details
TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning
Eric Nielsen
Michael Salib
Proceedings of the Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
Preview abstract
Debuggability is important in the development of machine-learning (ML) systems.
Several widely-used ML libraries, such as TensorFlow and Theano, are based on
dataflow graphs. While offering important benefits such as facilitating distributed
training, the dataflow graph paradigm makes the debugging of model issues more
challenging compared to debugging in the more conventional procedural paradigm.
In this paper, we present the design of the TensorFlow Debugger (tfdbg), a specialized
debugger for ML models written in TensorFlow. tfdbg provides features
to inspect runtime dataflow graphs and the state of the intermediate graph elements
("tensors"), as well as simulating stepping on the graph. We will discuss the
application of this debugger in development and testing use cases.
View details
What’s your ML test score? A rubric for ML production systems
Eric Nielsen
Michael Salib
Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
Preview abstract
Using machine learning in real-world production systems is complicated by a
host of issues not found in small toy examples or even large offline research
experiments. Testing and monitoring are key considerations for assessing the
production-readiness of an ML system. But how much testing and monitoring is
enough? We present an ML Test Score rubric based on a set of actionable tests to
help quantify these issues.
View details