Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities
Abstract
Machine learning (ML) is now commonplace, powering data-driven applications in a host of industries and organizations.Unlike the traditional perception of ML in research, ML pro-duction pipelines are complex, with many interlocking an-alytical components beyond training, whose sub-parts areoften run multiple times, over overlapping subsets of data.However, there is a lack of quantitative evidence regard-ing the lifespan, architecture, frequency, and complexityof these pipelines to understand how data management re-search can be used to make them more efficient, effective,robust, and reproducible. To that end, we analyze the prove-nance graphs of over 10K production ML pipelines at Googlespanning a period of over four months, in an effort to under-stand the complexity and challenges underlying productionML. Our analysis reveals the characteristics, components,and topologies of typical industry-strength ML pipelines atvarious granularities. Along the way, we introduce a newspecialized data model for representing and reasoning aboutrepeatedly run components (or sub-pipelines) of these MLpipelines, that we call model graphlets. We identify severalrich opportunities for optimization, leveraging traditionaldata management ideas. We show how targeting even oneof these opportunities, i.e., that of identifying and prun-ing wasted computation that doesn’t translate to deploy-ment, can reduce overall computation costs by between 30-50% without compromising the overall freshness of deployedmodels.