Jacques Pienaar
Research Areas
Authored Publications
Sort By
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner
Mehdi Amini
Uday Bondhugula
River Riddle
Tatiana Shpeisman
Nicolas Vasilache
Oleksandr Zinenko
CGO 2021
Preview abstract
This work presents the MLIR compiler infrastructure, which is a novel approach to building reusable compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduces the cost of building domain specific compilers, and aid
in connecting existing compilers together. MLIR facilitates the design and implementation of code
generators, translators and optimizer at different levels of abstraction and also across application domains, hardware targets and execution environments. The scientific perspective on these challenges is twofold: 1) evaluating MLIR as an infrastructure that enables new research and educational approaches on programming languages, compilers, code generators, execution environments, hardware acceleration and codesign; and 2) discussing MLIR as a research artifact built for extension and evolution, raising its own design, semantics, algorithmic, system, engineering, and multi-disciplinary challenges. The paper presents the rationale for MLIR, its
original design principles, structures and semantics, and validates these by surveying some applications of it.
View details
Equality Saturation for Tensor Graph Superoptimization
Mangpo Phothilimthana
Max Willsey
Remy Wang
Sudip Roy
Yichen Yang
MLSys (2021)
Preview abstract
One of the major optimizations employed in deep learning frameworks is graph rewriting. Production frameworks rely on heuristics to decide if rewrite rules should be applied and in which order. Prior research has shown that one can discover more optimal tensor computation graphs if we search for a better sequence of substitutions instead of relying on heuristics. However, we observe that existing approaches for tensor graph superoptimization both in production and research frameworks apply substitutions in a sequential manner. Such sequential search methods are sensitive to the order in which the substitutions are applied and often only explore a small fragment of the exponential space of equivalent graphs. This paper presents a novel technique for tensor graph superoptimization that employs equality saturation to apply all possible substitutions at once. We show that our approach can find optimized graphs with up to 16% speedup over state-of-the-art, while spending on average 48x less time optimizing.
View details
Preview abstract
MLIR - Multi-Level Intermediate Representation compiler infrastructure - was announced at C4ML last year and has since become an official LLVM subproject. It continues to grow as an open community of academic and industry collaborators building common infrastructure for compilers that operate on high level abstractions. In this talk I will focus on uses of MLIR specific to Machine Learning and in particular the TensorFlow ecosystem. This talk will go beyond current TensorFlow usage of MLIR, covering infrastructure work to enable future use cases (such as, ongoing work on dynamic pattern rewrites) and highlighting some community-driven efforts (in particular the tensor compute working group).
View details
Preview abstract
The growing diversity of domain-specific accelerators spans all scales from mobile devices to data centers. It constitutes a global challenge across the high-performance computing stack and is particularly visible in the field of Machine Learning (ML). Program representations and compilers need to support a variety of devices at multiple levels of abstraction, from scalar instructions to coarse-grain parallelism and large scale distribution of computation graphs. This puts great pressure on the construction of both generic and target-specific optimizations, with domain specific language support, interfaces with legacy and future infrastructure, and special attention to future-proofness, modularity and code reuse. This motivates the construction of a new infrastructure, unifying graph representations, ML operators, optimizations at different levels and also across levels, targets, ML frameworks, training and inference, quantization, tightly interacting with runtime systems. Compilers are expected to readily support new applications, to easily port to new hardware, to bridge many levels of abstraction from dynamic, managed languages to vector accelerators and software-managed memories, while exposing high level knobs for autotuning, enable just-in-time operation, provide diagnostics and propagate functional and performance debugging information across the entire stack, and delivering performance close enough to hand-written assembly in most cases. We will share our vision, progress and plans towards the design and public release of such a compiler infrastructure.
View details
Optimization Space Pruning without Regrets
Ulysse Beaugnon
Antoine Pouille
Marc Pouzet
Proceedings of the 26th International Conference on Compiler Construction, ACM, Austin, TX, USA (2017)
Preview abstract
Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures.
We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation.
The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures.
We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.
View details
GPUCC - An Open-Source GPGPU Compiler
Jingyue Wu
Mark Heffernan
Chris Leary
Bjarke Roune
Rob Springer
Xuetian Weng
Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, pp. 105-116
Preview abstract
Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%.
View details
JSWhiz - Static Analysis for JavaScript Memory Leaks
Proceedings of the 10th annual IEEE/ACM international symposium on Code generation and optimization, IEEE (2013)
Preview abstract
JavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources.
In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spreadsheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median.
View details