Eric Breck

Eric Breck

I am interested in practical issues in large-scale application of machine learning, including considerations of fairness as well as testing and validation of ML systems.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Data Validation for Machine Learning
    Marty Zinkevich
    Neoklis Polyzotis
    Steven Whang
    Sudip Roy
    Proceedings of SysML (2019) (to appear)
    Preview abstract Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. This argument points to a data-centric approach to machine learning that treats training and serving data as an important production asset, on par with the algorithm and infrastructure used for learning. In this paper, we tackle this problem and present a data validation system that is designed to detect anomalies specifically in data fed into machine learning pipelines. This system is deployed in production as an integral part of TFX (Baylor, 2017) -- an end-to-end machine learning platform at Google. It is used by hundreds of product teams use it to continuously monitor and validate several petabytes of production data per day. We faced several challenges in developing our system, most notably around the ability of ML pipelines to soldier on in the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, the techniques we used to address them, and the various design choices that we made in implementing the system. Finally, we present evidence from the system's deployment in production that illustrate the tangible benefits of data validation in the context of ML: early detection of errors, model-quality wins from using better data, savings in engineering hours to debug problems, and a shift towards data-centric workflows in model development. View details
    The Inclusive Images Competition
    Igor Ivanov
    Miha Skalic
    Pallavi Baljekar
    Pavel Ostyakov
    Roman Solovyev
    Weimin Wang
    Yoni Halpern
    Springer Series (2019)
    Preview abstract Popular large image classification datasets that are drawn from the web present Eurocentric and Americentric biases that negatively impact the generalizability of models trained on them . In order to encourage the development of modeling approaches that generalize well to images drawn from locations and cultural contexts that are unseen or poorly represented at the time of training, we organized the Inclusive Images competition in association with Kaggle and the NeurIPS 2018 Competition Track Workshop. In this chapter, we describe the motivation and design of the competition, present reports from the top three competitors, and provide high-level takeaways from the competition results. View details
    Preview abstract Modern machine learning systems such as image classifers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. Such strategies may be particularly important for ML applications in the developing world, where resources may be constrained and the cost of creating suitable large scale data sets may be a blocking factor. However, we suggest that examining the {\em geo-diversity} of open data sets is critical before adopting a data set for such use cases. In particular, we analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit a observable amerocentric and eurocentric representation bias. Further, we perform targeted analysis on classifiers that use these data sets as training data to assess the impact of these training distributions, and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world. View details
    TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
    Akshay Naresh Modi
    Chiu Yuen Koo
    Chuan Yu Foo
    Clemens Mewald
    Denis M. Baylor
    Jarek Wilkiewicz
    Levent Koc
    Lukasz Lew
    Martin A. Zinkevich
    Mustafa Ispir
    Neoklis Polyzotis
    Steven Whang
    Sudip Roy
    Sukriti Ramesh
    Vihan Jain
    Xin Zhang
    Zakaria Haque
    KDD 2017
    Preview abstract Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis. View details
    Preview abstract Creating reliable, production-level machine learning systems brings on a host of concerns not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing technical debt of ML systems. But it can be difficult to formulate specific tests, given that the actual prediction behavior of any given model is difficult to specify a priori. In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt. View details
    TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning
    Eric Nielsen
    Michael Salib
    Proceedings of the Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
    Preview abstract Debuggability is important in the development of machine-learning (ML) systems. Several widely-used ML libraries, such as TensorFlow and Theano, are based on dataflow graphs. While offering important benefits such as facilitating distributed training, the dataflow graph paradigm makes the debugging of model issues more challenging compared to debugging in the more conventional procedural paradigm. In this paper, we present the design of the TensorFlow Debugger (tfdbg), a specialized debugger for ML models written in TensorFlow. tfdbg provides features to inspect runtime dataflow graphs and the state of the intermediate graph elements ("tensors"), as well as simulating stepping on the graph. We will discuss the application of this debugger in development and testing use cases. View details
    What’s your ML test score? A rubric for ML production systems
    Eric Nielsen
    Michael Salib
    Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
    Preview abstract Using machine learning in real-world production systems is complicated by a host of issues not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for assessing the production-readiness of an ML system. But how much testing and monitoring is enough? We present an ML Test Score rubric based on a set of actionable tests to help quantify these issues. View details