Practical Performance Guarantees for Pipelined DNN Inference

Kuikui Liu
Prakash Prabhu
Proceedings of the 41st International Conference on Machine Learning(2024) (to appear)


This work optimizes pipeline parallelism of machine learning model inference by partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage. We design practical algorithms for this NP-complete problem and prove they are nearly optimal in practice by comparing against lower bounds obtained from solving novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound techniques to production models to achieve substantial improvements in the approximation guarantees, compared to simple combinatorial lower bounds. For example, our new MIP formulations improve the lower bounds enough to drop the geometric mean approximation ratio from $2.175$ to $1.082$ across production data with $k=16$ pipeline stages. This work shows that while bottleneck partitioning is theoretically hard, in practice we have a handle on the algorithmic side of the problem and much of the remaining challenge is in developing more accurate cost models to give to the partitioning algorithms.