Prakash Prabhu
I am a Sofware Engineer at Google with experience working in various aspects of search & advertising, ranging from mobile app query and app understanding, search over private corpora in the Cloud, and video ads feature engineering. My interests include information retrieval, distributed systems, parallel computing, program analysis & compiler optimizations. I received a PhD in Computer Science from Princeton University.
Authored Publications
Sort By
Practical Performance Guarantees for Pipelined DNN Inference
Kuikui Liu
Proceedings of the 41st International Conference on Machine Learning (2024), pp. 1655-1671
Preview abstract
This work optimizes pipeline parallelism of machine learning model inference by
partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage.
We design practical algorithms for this NP-complete problem
and prove they are nearly optimal in practice by comparing against lower bounds
obtained from solving novel mixed-integer programming (MIP) formulations.
We apply these algorithms and lower-bound techniques
to production models to achieve substantial improvements in the approximation guarantees,
compared to simple combinatorial lower bounds.
For example, our new MIP formulations improve the lower bounds enough to
drop the geometric mean approximation ratio from $2.175$ to $1.082$ across
production data with $k=16$ pipeline stages.
This work shows that while bottleneck partitioning is theoretically hard,
in practice we have a handle on the algorithmic side of the problem and
much of the remaining challenge is in developing more accurate cost models
to give to the partitioning algorithms.
View details
MemoDyn: Exploiting Weakly Consistent Data Structures for Dynamic Parallel Memoization
Stephen R. Beard
Ayal Zaks
David I. August
27th IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT) (2018)
Preview abstract
Several classes of algorithms for combinatorial search and optimization problems employ memoization data structures to speed up their serial convergence. However, accesses to these data structures impose dependences that obstruct program parallelization. Such programs often continue to function correctly even when queries into these data structures return a partial view of their contents. Weakening the consistency of these data structures can unleash new parallelism opportunities, potentially at the cost of additional computation. These opportunities must, therefore, be carefully exploited for overall speedup. This paper presents MEMODYN, a framework for parallelizing loops that access data structures with weakly consistent semantics. MEMODYN provides programming abstractions to express weak semantics, and consists of a parallelizing compiler and a runtime system that automatically and adaptively exploit the semantics for optimized parallel execution. Evaluation of MEMODYN shows that it achieves efficient parallelization, providing significant improvements over competing techniques in terms of both runtime performance and solution quality.
View details
Speculatively Exploiting Cross-Invocation Parallelism
Jialu Huang
Thomas B. Jablin
Soumyadeep Ghosh
Jae W. Lee
David I. August
25th IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT) (2016)
Preview abstract
Automatic parallelization has shown promise in producing
scalable multi-threaded programs for multi-core architectures.
Most existing automatic techniques parallelize independent
loops and insert global synchronization between
loop invocations. For programs with many loop invocations,
frequent synchronization often becomes the performance
bottleneck. Some techniques exploit cross-invocation
parallelism to overcome this problem. Using static analysis,
they partition iterations among threads to avoid crossthread
dependences. However, this approach may fail if
dependence pattern information is not available at compile
time. To address this limitation, this work proposes
SpecCross–the first automatic parallelization technique to
exploit cross-invocation parallelism using speculation. With
speculation, iterations from different loop invocations can
execute concurrently, and the program synchronizes only on
misspeculation. This allows SpecCross to adapt to dependence
patterns that only manifest on particular inputs at
runtime. Evaluation on eight programs shows that SpecCross
achieves a geomean speedup of 3.43× over parallel
execution without cross-invocation parallelization.
View details