Albert Cohen
Albert is a research scientist at Google. An alumnus of École Normale Supérieure de Lyon and the University of Versailles, he has been a research scientist at Inria, a visiting scholar at the University of Illinois, an invited professor at Philips Research, and a visiting scientist at Facebook Artificial Intelligence Research. Albert Cohen works on parallelizing and optimizing compilers, machine learning compilers, parallel and synchronous programming languages, with applications to high-performance computing, artificial intelligence and reactive control.
Authored Publications
Google Publications
Other Publications
Sort By
Code Generation for Data-Dependent Stencils
Mohammed Essadki
Bertrand Michel
Bruno Maugars
Oleksandr Zinenko
Nicolas Vasilache
CGO, IEEE(2023)
Preview abstract
Numerical simulation often resorts to iterative in-place stencils such as the Gauss-Seidel or Successive Overrelaxation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technology for image processing stencils, convolutions and out-of place iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors.
View details
RL4ReAl: Reinforcement Learning for Register Allocation
S. VenkataKeerthy
Siddharth Jain
Anilava Kundu
Rohit Aggarwal
Ramakrishna Upadrasta
CC 2023, ACM
Preview abstract
We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC-based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture independent: we show experimental results targeting Intel x86 and ARM AArch64. Our results match or out-perform the
heavily tuned, production-grade register allocators of LLVM.
View details
Structured Operations: Modular Design of Code Generators for Tensor Compilers
Nicolas Vasilache
Oleksandr Zinenko
Mahesh Ravishankar
Thomas Raoux
Alexander Belyaev
Matthias Springer
Tobias Gysi
Diego Caballero
Stephan Herhut
Stella Laurenzo
LCPC 2022, Springer(2023)
Preview abstract
The performance of machine learning systems heavily relies on code generators tailored to tensor computations.
We propose an approach to the design and implementation of such code generators leveraging the natural structure of tensor algebra and illustrating the progressive lowering of domain-specific abstractions in the MLIR infrastructure.
View details
Preview abstract
We investigate the programming of reactive systems combining closed-loop control with performance-
intensive components such as Machine Learning (ML). Reactive control systems are often safety-
critical and associated with real-time execution requirements, a domain of predilection for syn-
chronous programming languages. Extending the high levels of assurance found in reactive control
systems to computationally-intensive code remains an open issue. We tackle it by unifying concepts
and algorithms from synchronous languages with abstractions commonly found in general-purpose
and ML compilers. This unification across embedded and high-performance computing enables a high
degree of reuse of compiler abstractions and code. We first recall commonalities between dataflow
synchronous languages and the static single assignment (SSA) form of general-purpose/ML compilers.
We highlight the key mechanisms of synchronous languages that SSA does not cover—denotational
concepts such as synchronizing computations with an external time base, cyclic and reactive I/O, as
well as the operational notions of relaxing control flow dominance and the modeling of absent values.
We discover that initialization-related static analyses and code generation aspects can be fully
decoupled from other aspects of synchronous semantics such as memory management and causality
analysis, the latter being covered by existing dominance-based algorithms of SSA-form compilers.
We show how the SSA form can be seamlessly extended to enable all SSA-based transformations
and optimizations on reactive programs with synchronous concurrency. We derive a compilation
flow suitable for both high-performance and reactive aspects of a control application, by embedding
the Lustre dataflow synchronous language into the SSA-based MLIR/LLVM compiler infrastructure.
This allows the modeling of signal processing and deep neural network inference in the (closed) loop
of feedback-directed control systems. With only a minor efforts leveraging the MLIR infrastructure,
the generated code matches or outperforms state-of-the-art synchronous language compilers on
computationally-intensive ML applications.
View details
Autotuning Convolutions is Easier Than You Think
Nicolas Tollenaere
Guillaume Iooss
Stéphane Pouget
Hugo Brunie
Christophe Guillon
P. Sadayappan
Fabrice Rastello
ACM TACO(2022)
Preview abstract
A wide range of scientific and machine learning applications depend on highly optimized implementations
of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging
task, due to the complexity of the microarchitectural features that come into play when seeking near-peak
performance. Among the state-of-the-art techniques for loop transformations for performance optimization,
AutoScheduler tends to outperform other systems. It often yields higher performance as
compared to vendor libraries, but takes a large number of runs to converge, while also involving a complex
training environment.
In this paper, we define a structured configuration space that enables much faster convergence to highperformance code versions, using only random sampling of candidates. We focus on two-dimensional convolutions on CPUs. Compared to state-of-the-art libraries, our structured search space enables higher performance
for typical tensor shapes encountered in convolution stages in deep learning pipelines. Compared to autotuning code generators like AutoScheduler, it prunes the search space while increasing the density of efficient
implementations. We analyze the impact on convergence speed and performance distribution, on two Intel x86
processors and one ARM AArch64 processor. We match or outperform the performance of the state-of-the-art
oneDNN library and TVM’s AutoScheduler, while reducing the autotuning effort by at least an order of
magnitude.
View details
Preview abstract
This paper considers the correctness of domain-specific compilers for tensor programming languages through the study of Halide, a popular representative. It describes a translation validation algorithm for affine Halide specifications, independently of the scheduling language. The algorithm relies on “prophetic” annotations added by the compiler to the generated array assignments. The annotations provide a refinement mapping [Abadi and Lamport 1988] from assignments in the generated code to the tensor definitions from the specification. Our implementation leverages an affine solver and a general SMT solver, and scales to complete Halide benchmarks.
View details
Preview abstract
We propose a novel solution for the Register Allocation problem, leveraging multi-agent hierarchical Reinforcement Learning. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Experimental results match or outperform the LLVM register allocators, targeting Intel x86 and ARM AArch64.
View details
Progressive Raising in Multi-level IR
Lorenzo Chelini
Andi Drebes
Alex Zinenko
Nicolas Vasilache
Tobias Grosser
Henk Corporaal
International Conference on Code Generation and Optimization (CGO), ACM, February 27th - March 3rd, 2021, Virtual Conference(2021)
Preview abstract
Multi-level intermediate representation (IR) rewriting promises to lower the cost of designing domain-specific compilers by providing a non-opinionated IR, thus enabling to model the right abstraction level for the problem at hand. High-level abstractions are then lowered to low-level IR using
progressive lowering (i.e., from higher-level representations down to the lowest in small steps across the abstraction levels). But progressive lowering works in a single direction: high-level operations can be transformed into operations with lower-level of abstraction, but low-level operations are never raised to high-level ones. Thus, the entry point into the lowering pipeline defines the highest level of abstraction for all subsequent transformations, potentially limiting the set of applicable optimizations. This is especially true for general-purpose languages that are not semantically rich enough to
enter the higher parts of the lowering pipeline precluding aggressive domain-specific optimizations. To enable effective domain-specific compilation via progressive lowering in a multi-level IR compiler, we propose Multi-Level Tactics.
Multi-Level Tactics allows us to describe computational patterns and raise them to high-level abstractions declaratively. It enables a complementary path to progressive lowering, which we call progressive raising, hence extending the set of optimizations that can be performed on general-purpose languages in a multi-level IR compiler.
View details
Preview abstract
Secure applications implement protections against side-channel and physical attacks. Such protections embed input/output side-effects preventing optimizing compilers from altering the protection. These side-effects are error-prone and compiler-dependent, and the current practice involves analyzing the generated machine code to make sure security or privacy properties are still enforced. Vu et al. recently demonstrated how to automate the insertion of volatile side-effects in a compiler [30], but these may be too expensive in fine-grained protections such as control-flow integrity. We introduce observations of the program state that are intrinsic to the correct execution of security protections, along with means to specify and preserve observations across the compilation flow. Such observations complement the traditional input/output-preservation contract of compilers. We show how to guarantee their preservation without modifying compilation passes and with as little performance impact as possible. We validate our approach on a range of benchmarks, expressing the secure compilation of these applications in terms of observations to be made at specific program points.
View details
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner
Mehdi Amini
Uday Bondhugula
River Riddle
Tatiana Shpeisman
Nicolas Vasilache
Oleksandr Zinenko
CGO 2021
Preview abstract
This work presents the MLIR compiler infrastructure, which is a novel approach to building reusable compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduces the cost of building domain specific compilers, and aid
in connecting existing compilers together. MLIR facilitates the design and implementation of code
generators, translators and optimizer at different levels of abstraction and also across application domains, hardware targets and execution environments. The scientific perspective on these challenges is twofold: 1) evaluating MLIR as an infrastructure that enables new research and educational approaches on programming languages, compilers, code generators, execution environments, hardware acceleration and codesign; and 2) discussing MLIR as a research artifact built for extension and evolution, raising its own design, semantics, algorithmic, system, engineering, and multi-disciplinary challenges. The paper presents the rationale for MLIR, its
original design principles, structures and semantics, and validates these by surveying some applications of it.
View details
Reconciling Optimization With Secure Compilation
Son Tuan Vu
Arnaud De Grandmaison
Christophe Guillon
Karine Heydemann
Proceedings of the ACM (PACMPL)(2021)
Preview abstract
Software protections against side-channel and physical attacks are essential to the development of secure applications. Such protections are meaningful at machine code or micro-architectural level, but they typically do not carry observable semantics at source level. This renders them susceptible to miscompilation, and security engineers embed input/output side-effects to prevent optimizing compilers from altering them. Yet these side-effects are error-prone and compiler-dependent. The current practice involves analyzing the generated machine code to make sure security or privacy properties are still enforced. They may also be too expensive in fine-grained protections such as control-flow integrity. We introduce observations of the program state that are intrinsic to the correct execution of security protections, along with means to specify and preserve observations across the compilation flow. Such observations complement the input/output semantics-preservation contract of compilers. We introduce an opacification mechanism to preserve and enforce a partial ordering of observations. This approach is compatible with a production compiler and does not incur any modification to its optimization passes. We validate the effectiveness and performance of our approach on a range of benchmarks, expressing the secure compilation of these applications in terms of observations to be made at specific program points.
View details
Preview abstract
Floating Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries. Furthermore, numerical computations often involve iterative solvers where the residual error is a function of the input data, or where dynamically adaptive precision can accelerate convergence; numerical analysts have to resort to explicit conversions and multi-versioning, resulting in code bloat and making the intent of the program even less clear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. This flow enables classical optimizations as well as multi-precision-specific ones associated with memory management and target-specific implementation. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, and leveraging two alternative backend code generators: one that targets the ISA of a variable precision FP arithmetic Co-processor, and one targeting the MPFR multi-precision floating point library. Both targets support the statically and dynamically adaptable precision and size of our language extension. On the PolyBench suite, our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.84x.
View details
Efficient Convolution Optimisation by Composing Microkernels
Nicolas Tollenaere
Auguste Olivry
Guillaume Iooss
Hugo Brunie
P Sadayappan
Fabrice Rastello
INRIA(2021)
Preview abstract
Optimizing the implementation of tensor computations is essential to exploiting the full capacity
of a given processor architecture on a wide range of scientific and machine learning applications.
However, the complexity of the microarchitectural features that come into play when approaching
the peak performance of the processor makes it very hard. Focusing on 2D convolutions, we observe a
common weakness in all tensor compilers and libraries related to efficiently covering the wide variety
of problem sizes occurring in real-world applications.
We propose TTile, a domain-specific code generator and autotuner for implementing efficient
convolutions. Similarly to BLIS, TTile nests multiple levels of tiling above a vectorized tensor
contraction microkernel. But unlike traditional approaches, we explore of a variety of microkernels
and compose them to fit exactly the tensor shapes of a convolution. While this helps achieving
consistently high performance on virtually all possible tensor sizes, our method also introduces more
degrees of freedom in the optimization space, which makes it challenging for autotuning strategies.
To address this, we leverage an analytical model of data movement, and combine it with
feedback-directed autotuning. We evaluate TTile as a stand-alone compiler and also as a complement
to TVM on recent Intel x86 microarchitectures.
View details
VP Float: First Class Treatment for Variable Precision Floating Point Arithmetic - Poster
PACT 2020, ACM
Preview abstract
Optimizing compilers for high performance computing only support IEEE 754 floating-point (FP) types and applications needing higher precision involve cumbersome memory management and calls to external libraries. We introduce an extension of the C type system to represent variable-precision FP arithmetic, supporting both static and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and hardware FP instructions or software libraries. We demonstrate the effectiveness of our solution by enabling the full range of LLVM optimizations and leveraging two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision FP library. Both targets support the static and dynamically adaptable precision of our type system. On the PolyBench suite, our optimizing compilation flow targeting MPFR is shown to outperform the Boost programming interface for the MPFR library.
View details
TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory
Andi Drebes
Lorenzo Chelini
Oleksandr Zinenko
Henk Corporaal
Tobias Grosser
Kanishkan Vadivel
Nicolas Vasilache
IMPACT 2020 workshop (associated with HIPEAC 2020)
Preview abstract
Memristor-based, non-von-Neumann architectures performing tensor
operations directly in memory are a promising approach to address the
ever-increasing demand for energy-efficient, high-throughput hardware
accelerators for Machine Learning (ML) inference. A major challenge
for the programmability and exploitation of such
Computing-In-Memory (CIM) architectures consists in the efficient mapping of tensor
operations from high-level ML frameworks to fixed-function
hardware blocks implementing in-memory computations.
We demonstrate the programmability of memristor-based accelerators
with TC-CIM, a fully-automatic, end-to-end compilation flow from
Tensor Comprehensions, a mathematical notation for tensor
operations, to fixed-function memristor-based hardware blocks.
Operations suitable for acceleration are identified
using Tactics, a declarative framework to describe
computational patterns in a polyhedral representation.
We evaluate our compilation flow on a
system-level simulator based on Gem5, incorporating crossbar arrays of
memristive devices. Our results show that TC-CIM reliably
recognizes tensor operations commonly used in ML workloads across
multiple benchmarks in order to offload these operations to the
accelerator.
View details
Preview abstract
Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional ``scalene trapezoid'' shapes as well as novel ``right-rectangle'' variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators.
View details
Preview abstract
The development of lightweight polyhedral compilation
algorithms opens polyhedral loop transformation, parallelization and code generation to a larger class or programs. The Pluto scheduling algorithm plays a major
role in state-of-the-art polyhedral compilers, aiming for
the simultaneous enhancement of locality and the exploitation of coarse-grain parallelism through loop tiling. Reducing the run time of affine scheduling algorithms
like Pluto has a significant impact on the overall compilation time of polyhedral compilers. Several approaches have been proposed to reduce the run time of affine
scheduling while preserving most of the optimization opportunities. Yet these works have taken separate rather than consolidated attempts at the problem. In an attempt to better characterize the potential and limitations of such approaches, we introduce and evaluate a family
of techniques called offline statement clustering. Program statements are clustered into macro-statements and the dependence graph is projected onto these macrostatements before affine scheduling. Offline statement clustering integrates transparently into the flow of a
state-of-the-art polyhedral compiler and can reduce the
scheduling time by a factor of 6 (median) without inducing a significant loss in optimization opportunities. We also study the theoretical and experimental properties
of statement clustering, shedding new light on the leading syntax-driven heuristic. Our work-in-progress
study confirms the surprising finding that the simpler,
apparently more fragile and syntax-dependent methods
tend to perform well on a wide range of benchmarks.
View details
Secure Delivery of Program Properties Through Optimizing Compilation
Son Tuan Vu
Karine Heydemann
Arnaud de Grandmaison
ACM International Conference on Compiler Construction (CC)(2020)
Preview abstract
Annotations and assertions capturing static program properties are
ubiquitous, from robust software engineering to safety-critical or
secure code. These may be functional or non-functional properties of
control and data flow, memory usage, I/O and real time.
We propose an approach to encode, translate, and
preserve the semantics of both functional and non-functional
properties along the optimizing compilation of C to machine
code. The approach involves (1) capturing and translating source-level
properties through lowering passes and intermediate representations,
such that data and control flow optimizations will preserve their
consistency with the transformed program, and (2) carrying properties
and their translation as debug information down to machine code.
Our experiments using LLVM validate the soundness, expressiveness and efficiency of the
approach, considering a reference suite of functional properties
as well as established security properties and applications hardened
against side-channel attacks.
View details
Preview abstract
The development of high-performance numerical libraries
has long been an art reserved for experts. Domain-specific
code generators, auto-tuners, and methodologies promise to
raise the level of abstraction and improve productivity, or even
fully automate the process of porting and tuning numerical
kernels. Yet, the state of the art seems to be trapped in a
productivity vs. performance trade-off. This is most unsatis-
factory for hardware accelerators, where reaching near-peak
performance provides much of the rationale for their deploy-
ment, and where performance is highly sensitive to decisions
crossing multiple layers of abstraction, parallelism and data
movement orchestration.
Focusing on Nvidia GPUs, we investigate the direct synthe-
sis of loop nest and array-based optimizations. Rather than
composing program transformations on semantics-preserving
intermediate representations, optimization synthesis leverages
principles of program synthesis to explore a search space
tailored to a given algorithmic kernel specification and a tar-
get GPU architecture. This search space does not make any
heuristic assumption on the profitability of individual code gen-
eration choices or on the semantic and performance interplay
of these, nor does it involve any rewriting rule. Its exploration
is driven by an original performance model providing a lower
bound on the execution time of a set of candidate implementa-
tions. Unlike models for program transformation systems, our
approach is unaffected by pending transformations clogging
the performance estimation horizon. The exploration also uses
feedback from running the generated code, albeit many orders
of magnitude less often than querying the performance model.
For semantics preservation, the search is filtered by the control
and data flow constraints derived from the algorithmic specifi-
cation. Candidate implementations are formally modeled as
bounded sets of code generation choices whose instantiations
commute, facilitating and accelerating the exploration. We
evaluate our approach on matrix computations occurring in
scientific computing and convolutional neural networks.
View details
Preview abstract
This work proposes to extend RISC-V with Variable
Precision (VP) Floating-Point (FP) capabilities to accelerate
scientific computing applications. It adopts the UNUM type I
FP format in main memory to overcome the limitation of the
IEEE 754 standard. Our work comprises: 1/ a VP FP RISC-V
coprocessor; 2/ a RISC-V ISA extension for the unit, 3/ and a
programming model to support VP floats in C/C++. Results have
shown that our system can be more than 100x faster than the
MPFR library when executing basic arithmetic operations.
View details
Byte-Aware Floating-point Operations through a UNUM Computing Unit
Andrea Bocco
Tiago T. Jost
Florent de Dinechin
Yves Durand
Christian Fabre
27th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SOC)(2019)
Preview abstract
Most floating-point (FP) hardware support the IEEE
754 format, which defines fixed-size data types from 16 to 128 bits.
However, a range of applications benefit from different formats,
implementing different tradeoffs. This paper proposes a Variable
Precision (VP) computing unit offering a finer granularity of high
precision FP operations. The chosen memory format is derived
from UNUM type I, where the size of a number is stored within
the representation itself. The unit implements a fully pipelined
architecture, and it supports up to 512 bits of precision for both
interval and scalar computing. The user can configure the storage
format up to 8-bit granularity, and the internal computing
precision at 64-bit granularity. The system is integrated as a
RISC-V coprocessor. Dedicated compiler support exposes the unit
through a high level programming abstraction, covering all the
operating features of UNUM type I. FPGA-based measurements
show that the latency and the computation accuracy of this
system scale linearly with the memory format length set by the
user. Compared with the MPFR software library, the proposed
unit achieves speedups between 3.5x and 18x, with comparable
accuracy.
View details
Optimization Space Pruning without Regrets
Ulysse Beaugnon
Antoine Pouille
Marc Pouzet
Proceedings of the 26th International Conference on Compiler Construction, ACM, Austin, TX, USA(2017)
Preview abstract
Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures.
We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation.
The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures.
We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.
View details
The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically
Nicolas Vasilache
Oleksandr Zinenko
Theodoros Theodoridis
Priya Goyal
Zachary Devito
William S. Moses
Sven Verdoolaege
Andrew Adams
ACM Transactions on Architecture and Code Optimization (TACO)(2019)
Preview abstract
Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of highly tuned libraries.
View details
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Nicolas Vasilache
Alex Zinenko
Theodoros Theodoridis
Priya Goyal
Zachary DeVito
William S. Moses
Sven Verdoolaege
Andrew Adams
Facebook Artificial Intelligence Research(2018)
Preview abstract
Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner.
View details