On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU
Abstract
The development of high-performance numerical libraries
has long been an art reserved for experts. Domain-specific
code generators, auto-tuners, and methodologies promise to
raise the level of abstraction and improve productivity, or even
fully automate the process of porting and tuning numerical
kernels. Yet, the state of the art seems to be trapped in a
productivity vs. performance trade-off. This is most unsatis-
factory for hardware accelerators, where reaching near-peak
performance provides much of the rationale for their deploy-
ment, and where performance is highly sensitive to decisions
crossing multiple layers of abstraction, parallelism and data
movement orchestration.
Focusing on Nvidia GPUs, we investigate the direct synthe-
sis of loop nest and array-based optimizations. Rather than
composing program transformations on semantics-preserving
intermediate representations, optimization synthesis leverages
principles of program synthesis to explore a search space
tailored to a given algorithmic kernel specification and a tar-
get GPU architecture. This search space does not make any
heuristic assumption on the profitability of individual code gen-
eration choices or on the semantic and performance interplay
of these, nor does it involve any rewriting rule. Its exploration
is driven by an original performance model providing a lower
bound on the execution time of a set of candidate implementa-
tions. Unlike models for program transformation systems, our
approach is unaffected by pending transformations clogging
the performance estimation horizon. The exploration also uses
feedback from running the generated code, albeit many orders
of magnitude less often than querying the performance model.
For semantics preservation, the search is filtered by the control
and data flow constraints derived from the algorithmic specifi-
cation. Candidate implementations are formally modeled as
bounded sets of code generation choices whose instantiations
commute, facilitating and accelerating the exploration. We
evaluate our approach on matrix computations occurring in
scientific computing and convolutional neural networks.
has long been an art reserved for experts. Domain-specific
code generators, auto-tuners, and methodologies promise to
raise the level of abstraction and improve productivity, or even
fully automate the process of porting and tuning numerical
kernels. Yet, the state of the art seems to be trapped in a
productivity vs. performance trade-off. This is most unsatis-
factory for hardware accelerators, where reaching near-peak
performance provides much of the rationale for their deploy-
ment, and where performance is highly sensitive to decisions
crossing multiple layers of abstraction, parallelism and data
movement orchestration.
Focusing on Nvidia GPUs, we investigate the direct synthe-
sis of loop nest and array-based optimizations. Rather than
composing program transformations on semantics-preserving
intermediate representations, optimization synthesis leverages
principles of program synthesis to explore a search space
tailored to a given algorithmic kernel specification and a tar-
get GPU architecture. This search space does not make any
heuristic assumption on the profitability of individual code gen-
eration choices or on the semantic and performance interplay
of these, nor does it involve any rewriting rule. Its exploration
is driven by an original performance model providing a lower
bound on the execution time of a set of candidate implementa-
tions. Unlike models for program transformation systems, our
approach is unaffected by pending transformations clogging
the performance estimation horizon. The exploration also uses
feedback from running the generated code, albeit many orders
of magnitude less often than querying the performance model.
For semantics preservation, the search is filtered by the control
and data flow constraints derived from the algorithmic specifi-
cation. Candidate implementations are formally modeled as
bounded sets of code generation choices whose instantiations
commute, facilitating and accelerating the exploration. We
evaluate our approach on matrix computations occurring in
scientific computing and convolutional neural networks.