Data Management Challenges in Production Machine Learning
Abstract
This tutorial discusses data-management issues that
arise in the context of production ML pipelines. Informed
by our own experience with such large-scale pipelines, we
focus on issues related to validating, debugging, cleaning,
understanding, and enriching training data. The goal of the
tutorial is to bring forth these issues, draw connections to
prior work in the database literature, and outline the open
research questions that are not addressed by prior art. We
believe that the data management community is well positioned
to address these issues and we hope to motivate the
audience to look more closely in this area.
arise in the context of production ML pipelines. Informed
by our own experience with such large-scale pipelines, we
focus on issues related to validating, debugging, cleaning,
understanding, and enriching training data. The goal of the
tutorial is to bring forth these issues, draw connections to
prior work in the database literature, and outline the open
research questions that are not addressed by prior art. We
believe that the data management community is well positioned
to address these issues and we hope to motivate the
audience to look more closely in this area.