Xinliang David Li

Xinliang David Li

David is a Principal Engineer at Google. He manages the Compiler Optimization Team and leads the effort to generate the fastest possible code for Google's data center softwares. His research interests include highly scalable cross module optimizations, scalable profile guided optimizations (instrumentation, PMU sample, trace-based), memory hierarchy optimizations (data and instructions), micro-architecture optimizations, post link optimizations, and ML based optimization frameworks.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    PreFix: Optimizing the Performance of Heap-Intensive Applications
    Chaitanya Mamatha Ananda
    Rajiv Gupta
    Han Shen
    CGO 2025: International Symposium on Code Generation and Optimization, Las Vegas, NV, USA (to appear)
    Preview abstract Analyses of heap-intensive applications show that a small fraction of heap objects account for the majority of heap accesses and data cache misses. Prior works like HDS and HALO have shown that allocating hot objects in separate memory regions can improve spatial locality leading to better application performance. However, these techniques are constrained in two primary ways, limiting their gains. First, these techniques have Imperfect Separation, polluting the hot memory region with several cold objects. Second, reordering of objects across allocations is not possible as the original object allocation order is preserved. This paper presents a novel technique that achieves near perfect separation of hot objects via a new context mechanism that efficiently identifies hot objects with high precision. This technique, named PreFix, is based upon Preallocating memory for a Fixed small number of hot objects. The program, guided by profiles, is instrumented to compute context information derived from dynamic object identifiers, that precisely identifies hot object allocations that are then placed at predetermined locations in the preallocated memory. The preallocated memory region for hot objects provides the flexibility to reorder objects across allocations and allows colocation of objects that are part of a hot data stream (HDS), improving spatial locality. The runtime overhead of identifying hot objects is not significant as this optimization is only focused on a small number of static hot allocation sites and dynamic hot objects. While there is an increase in the program’s memory foot-print, it is manageable and can be controlled by limiting the size of the preallocated memory. In addition, PreFix incorporates an object recycling optimization that reuses the same preallocated space to store different objects whose lifetimes are not expected to overlap. Our experiments with 13 heap-intensive applications yields reductions in execution times ranging from 2.77% to 74%. On average PreFix reduces execution time by 21.7% compared to 7.3% by HDS and 14% by HALO. This is due to PreFix’s precision in hot object identification, hot object colocation, and low runtime overhead. View details
    Preview abstract While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks. View details
    Preview abstract Estimating the probability with which a conditional branch instruction is taken is an important analysis that enables many optimizations in modern compilers. When using Profile Guided Optimizations (PGO), compilers are able to make a good estimation of the branch probabilities. In the absence of profile information, compilers resort to using heuristics for this purpose. In this work, we propose learning branch probabilities from a large corpus of data obtained from datacenter workloads. Using metrics including Root Mean Squared Error, Mean Absolute Error and cross-entropy, we show that the machine learning model improves branch probability estimation by 18-50% in comparison to compiler heuristics. This translates to performance improvement of up to 8.1% on 24 out of a suite of 40 benchmarks with a 1% geomean improvement on the suite. This also results in greater than 1.2% performance improvement in an important search application. View details
    automemcpy: A framework for automatic generation of fundamental memory operations
    Sam Xi
    Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management (ISMM '21), June 22, 2021, Virtual, Canada (to appear)
    Preview abstract Memory manipulation primitives (memcpy, memset, memcmp) are used by virtually every application, from high performance computing to user interfaces. They often consume a significant portion of CPU cycles. Because they are so ubiquitous and critical, they are provided by language runtimes and in particular by libc, the C standard library. These implementations are heavily optimized, typically written in hand-tuned assembly for each target architecture. In this article, we propose a principled alternative to hand-tuning these functions: (1) we profile the calls to these functions in their production environment and use this data to drive the important high-level algorithmic decisions, (2) we use a high-level language for the implementation, delegate the job of tuning the generated code to the compiler, and (3) we use constraint programming and automatic benchmarks to select the optimal high-level structure of the functions. We compile our memfunctions implementations using the same compiler toolchain that we use for application code, which allows leveraging the compiler further by allowing whole-program optimization. We have evaluated our approach by applying it to the fleet of one of the largest computing enterprises in the world. This work increased the performance of the fleet by 1%. View details
    Preview abstract Leveraging machine-learning (ML) techniques for compiler optimizations has been widely studied and explored in academia. However, the adoption of ML in general-purpose, industry strength compilers has yet to happen. We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler -- LLVM. As a case study, we present the details and results of replacing the heuristics-based inlining-for-size optimization in LLVM with machine learned models. To the best of our knowledge, this work is the first full integration of ML in a complex compiler pass in a real-world setting. It is available in the main LLVM repository. We use two different ML algorithms: Policy Gradient and Evolution Strategies, to train the inlining-for-size model, and achieve up to 7\% size reduction, when compared to state of the art LLVM -Oz. The same model, trained on one corpus, generalizes well to a diversity of real-world targets, as well as to the same set of targets after months of active development. This property of the trained models is beneficial to deploy ML techniques in real-world settings. View details
    Preview abstract Cross-Module Optimization (CMO) is an effective means for improving runtime performance, by extending the scope of optimizations across source module boundaries. Two CMO approaches are Link-Time Optimization (LTO) and Lightweight Inter-Procedural Optimization (LIPO). However, each of these solutions has limitations that prevent it from being enabled by default. ThinLTO is a new approach that attempts to address these limitations, with a goal of being enabled more broadly. ThinLTO aims to be as scalable as a regular non-LTO build, enabling CMO on large applications and machines without large memory configurations, while also integrating well with distributed and incremental build systems. This is achieved through fast purely summary-based Whole-Program Analysis (WPA), the only serial step, without reading or writing the program's Intermediate Representation (IR). Instead, CMO is applied during fully parallel optimization backends. This paper describes the motivation behind ThinLTO, its overall design, and current implementation in LLVM. Results from SPEC cpu2006 benchmarks and several large real-world applications illustrate that ThinLTO can scale as well as a non-LTO build while enabling most of the CMO performed with a full LTO build. View details
    AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications
    Dehao Chen
    CGO 2016 Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, USA, pp. 12-23
    Preview abstract AutoFDO is a system to simplify real-world deployment of feedback-directed optimization (FDO). The system works by sampling hardware performance monitors on production machines and using those profiles to guide optimization. Profile data is stale by design, and we have implemented compiler features to deliver stable speedup across releases. The resulting performance is a geomean of 10.5% improvement on our benchmarks. AutoFDO achieves 85% of the gains of traditional FDO, despite imprecision due to sampling and information lost in the compilation pipeline. The system is deployed to hundreds of binaries at Google, and it is extremely easy to enable; users need only to add some flags to their release build. To date, AutoFDO has increased the number of FDO users at Google by 8X and has doubled the number of cycles spent in FDO-optimized binaries. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries. View details
    Automated locality optimization based on the reuse distance of string operations
    Silvius Rus
    Raksit Ashok
    CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, IEEE Computer Society, Washington, DC, USA (2011), pp. 181-190
    Preview
    Preview abstract We have found that large C++ applications and shared libraries tend to have many functions whose code is identical with another function. As much as 10% of the code could theoretically be eliminated by merging such identical functions into a single copy. This optimization, Identical Code Folding (ICF), has been implemented in the gold linker. At link time, ICF detects functions with identical object code and merges them into a single copy. ICF can be unsafe, however, as it can change the run-time behaviour of code that relies on each function having a unique address. To address this, ICF can be used in a safe mode where it identifies and folds functions whose addresses are guaranteed not to have been used in comparison operations. Further, profiling and debugging binaries with merged functions can be confusing, as the PC values of merged functions cannot be always disambiguated to point to the correct function. To address this, we propose a new call table format for the DWARF debugging information to allow tools like the debugger and profiler to disambiguate PC values of merged functions correctly by examining the call chain. Detailed experiments on the x86 platform show that ICF can reduce the text size of a selection of Google binaries, whose average text size is 64 MB, by about 6%. Also, the code size savings of ICF with the safe option is almost as good as the code savings obtained without the safe option. Further, experiments also show that the run-time performance of the optimized binaries on the x86 platform does not change. View details
    Lightweight Feedback-Directed Cross-Module Optimization
    Raksit Ashok
    Proceedings of International Symposium on Code Generation and Optimization (CGO), IEEE (2010)
    Preview abstract Cross-module inter-procedural compiler optimization (IPO) and Feedback-Directed Optimization (FDO) are two important compiler techniques delivering solid performance gains. The combination of IPO and FDO delivers peak performance, but also multiplies both techniques' usability problems. In this paper, we present LIPO, a novel static IPO framework, which integrates IPO and FDO. Compared to existing approaches, LIPO no longer requires writing of the compiler's intermediate representation, eliminates the link-time inter-procedural optimization phase entirely, and minimizes code re-generation overhead, thus improving scalability by an order of magnitude. Compared to an FDO baseline, and without further specific tuning, LIPO improves performance of SPEC2006 INT by 2.5%, and of SPEC2000 INT by 4.4%, with up to 23% for one benchmarks. We confirm our scalability results on a set of large industrial applications, demonstrating 2.9% performance improvements on average. Compile time overhead for full builds is less than 30%, incremental builds take a few seconds on average, and storage requirements increase by only 24%, all compared to the FDO baseline. View details