Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Doris Xin

Hui Miao

Aditya Parameswaran

Neoklis Polyzotis

ACM SIGMOD 2021

Download Google Scholar

Abstract

Machine learning (ML) is now commonplace, powering data-driven applications in a host of industries and organizations.Unlike the traditional perception of ML in research, ML pro-duction pipelines are complex, with many interlocking an-alytical components beyond training, whose sub-parts areoften run multiple times, over overlapping subsets of data.However, there is a lack of quantitative evidence regard-ing the lifespan, architecture, frequency, and complexityof these pipelines to understand how data management re-search can be used to make them more efficient, effective,robust, and reproducible. To that end, we analyze the prove-nance graphs of over 10K production ML pipelines at Googlespanning a period of over four months, in an effort to under-stand the complexity and challenges underlying productionML. Our analysis reveals the characteristics, components,and topologies of typical industry-strength ML pipelines atvarious granularities. Along the way, we introduce a newspecialized data model for representing and reasoning aboutrepeatedly run components (or sub-pipelines) of these MLpipelines, that we call model graphlets. We identify severalrich opportunities for optimization, leveraging traditionaldata management ideas. We show how targeting even oneof these opportunities, i.e., that of identifying and prun-ing wasted computation that doesn’t translate to deploy-ment, can reduce overall computation costs by between 30-50% without compromising the overall freshness of deployedmodels.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities