Practical Performance Guarantees for Pipelined DNN Inference

Kuikui Liu
Proceedings of the 41st International Conference on Machine Learning (2024)

Abstract

This work optimizes pipeline parallelism of machine learning model inference by
partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage.
We design practical algorithms for this NP-complete problem
and prove they are nearly optimal in practice by comparing against lower bounds
obtained from solving novel mixed-integer programming (MIP) formulations.
We apply these algorithms and lower-bound techniques
to production models to achieve substantial improvements in the approximation guarantees,
compared to simple combinatorial lower bounds.
For example, our new MIP formulations improve the lower bounds enough to
drop the geometric mean approximation ratio from $2.175$ to $1.082$ across
production data with $k=16$ pipeline stages.
This work shows that while bottleneck partitioning is theoretically hard,
in practice we have a handle on the algorithmic side of the problem and
much of the remaining challenge is in developing more accurate cost models
to give to the partitioning algorithms.