- Alkis Polyzotis
- Martin A. Zinkevich
- Steven Whang
- Sudip Roy
Abstract
This tutorial discusses data-management issues that arise in the context of production ML pipelines. Informed by our own experience with such large-scale pipelines, we focus on issues related to validating, debugging, cleaning, understanding, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art. We believe that the data management community is well positioned to address these issues and we hope to motivate the audience to look more closely in this area.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work