Ondrej Sykora
Ondrej is a software engineer on the Compiler Research and Operations Research teams in Paris. He's interested in applications of combinatorial optimization, linear and integer programming, CPU modeling, code generation, superoptimization and advanced code optimization algorithms.
Ondrej obtained his masters degree in Computer Science and Artificial Intelligence at Charles University in Prague in 2007, and worked on his (yet unfinished) PhD there before joining Google Paris full-time in 2012.
Authored Publications
Sort By
GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
Mangpo Phothilimthana
Thirimadura C. Yasendra Mendis
2022 IEEE International Symposium on Workload Characterization (2022) (to appear)
Preview abstract
Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a~rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% wile improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation.
View details
automemcpy: A framework for automatic generation of fundamental memory operations
Sam Xi
Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management (ISMM '21), June 22, 2021, Virtual, Canada (to appear)
Preview abstract
Memory manipulation primitives (memcpy, memset, memcmp) are used by virtually every application, from high performance computing to user interfaces. They often consume a significant portion of CPU cycles. Because they are so ubiquitous and critical, they are provided by language runtimes and in particular by libc, the C standard library. These implementations are heavily optimized, typically written in hand-tuned assembly for each target architecture.
In this article, we propose a principled alternative to hand-tuning these functions: (1) we profile the calls to these functions in their production environment and use this data to drive the important high-level algorithmic decisions, (2) we use a high-level language for the implementation, delegate the job of tuning the generated code to the compiler, and (3) we use constraint programming and automatic benchmarks to select the optimal high-level structure of the functions.
We compile our memfunctions implementations using the same compiler toolchain that we use for application code, which allows leveraging the compiler further by allowing whole-program optimization. We have evaluated our approach by applying it to the fleet of one of the largest computing enterprises in the world. This work increased the performance of the fleet by 1%.
View details
Preview abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing code security with variable-length instruction encoding. An aspect may include encoding instructions according to a set of rules. Such encoding may delimit the beginning of instructions using a specific bit, delimit the beginning of a basic block using another specific bit, enable handling of variable length constants that code size, and make code validation decidable and deterministic. Accordingly, code may be scanned once to verify that the code can be decoded into actual instruction, non-computed branches can be validated, and computed branches may trap if they branch to other places than the beginning of a basic starting block.
View details