Jump to Content

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Ulysse Beaugnon
arXiv (2019)


The development of high-performance numerical libraries has long been an art reserved for experts. Domain-specific code generators, auto-tuners, and methodologies promise to raise the level of abstraction and improve productivity, or even fully automate the process of porting and tuning numerical kernels. Yet, the state of the art seems to be trapped in a productivity vs. performance trade-off. This is most unsatis- factory for hardware accelerators, where reaching near-peak performance provides much of the rationale for their deploy- ment, and where performance is highly sensitive to decisions crossing multiple layers of abstraction, parallelism and data movement orchestration. Focusing on Nvidia GPUs, we investigate the direct synthe- sis of loop nest and array-based optimizations. Rather than composing program transformations on semantics-preserving intermediate representations, optimization synthesis leverages principles of program synthesis to explore a search space tailored to a given algorithmic kernel specification and a tar- get GPU architecture. This search space does not make any heuristic assumption on the profitability of individual code gen- eration choices or on the semantic and performance interplay of these, nor does it involve any rewriting rule. Its exploration is driven by an original performance model providing a lower bound on the execution time of a set of candidate implementa- tions. Unlike models for program transformation systems, our approach is unaffected by pending transformations clogging the performance estimation horizon. The exploration also uses feedback from running the generated code, albeit many orders of magnitude less often than querying the performance model. For semantics preservation, the search is filtered by the control and data flow constraints derived from the algorithmic specifi- cation. Candidate implementations are formally modeled as bounded sets of code generation choices whose instantiations commute, facilitating and accelerating the exploration. We evaluate our approach on matrix computations occurring in scientific computing and convolutional neural networks.