Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

people standing in front of a screen with images and a chipboard

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 691 publications
    Dynamic Inference of Likely Symbolic Tensor Shapes in Python Machine Learning Programs
    Koushik Sen
    International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)(2024) (to appear)
    Preview abstract In machine learning programs, it is often tedious to annotate the dimensions of shapes of various tensors that get created during execution. We present a dynamic likely tensor shape inference analysis that annotates the dimensions of shapes of tensor expressions with symbolic dimension values. Such annotations can be used for understanding the machine learning code written in popular frameworks, such as TensorFlow, PyTorch, JAX, and for finding bugs related to tensor shape mismatch. View details
    Preview abstract 2022 marked the 50th anniversary of memory safety vulnerabilities, first reported by Anderson et al. Half a century later, we are still dealing with memory safety bugs despite substantial investments to improve memory unsafe languages. Like others', Google’s data and internal vulnerability research show that memory safety bugs are widespread and one of the leading causes of vulnerabilities in memory-unsafe codebases. Those vulnerabilities endanger end users, our industry, and the broader society. At Google, we have decades of experience addressing, at scale, large classes of vulnerabilities that were once similarly prevalent as memory safety issues. Based on this experience we expect that high assurance memory safety can only be achieved via a Secure-by-Design approach centered around comprehensive adoption of languages with rigorous memory safety guarantees. We see no realistic path for an evolution of C++ into a language with rigorous memory safety guarantees that include temporal safety. As a consequence, we are considering a gradual transition of C++ code at Google towards other languages that are memory safe. Given the large volume of pre-existing C++, we believe it is nonetheless necessary to improve the safety of C++ to the extent practicable. We are considering transitioning to a safer C++ subset, augmented with hardware security features like MTE. View details
    CodeQueries: A Dataset of Semantic Queries over Code
    Surya Prakash Sahu
    Madhurima Mandal
    Shikhar Bharadwaj
    Aditya Kanade
    Shirish Shevade
    Innovations in Software Engineering (ISEC), ACM, Bangalore, India(2024)
    Preview abstract Developers often have questions about semantic aspects of code they are working on, e.g., “Is there a class whose parent classes declare a conflicting attribute?”. Answering them requires understanding code semantics such as attributes and inheritance relation of classes. An answer to such a question should identify code spans constituting the answer (e.g., the declaration of the subclass) as well as supporting facts (e.g., the definitions of the conflicting attributes). The existing work on question-answering over code has considered yes/no questions or method-level context. We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code. Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans. We curate the dataset based on queries supported by a widely-used static analysis tool, CodeQL, and include both positive and negative examples, and queries requiring single-hop and multi-hop reasoning. To assess the value of our dataset, we evaluate baseline neural approaches. We study a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries. We also evaluate a BERT style model (CuBERT) with fine-tuning. We find that these models achieve limited success on CodeQueries. CodeQueries is thus a challenging dataset to test the ability of neural models, to understand code semantics, in the extractive question-answering setting View details
    Limoncello: Prefetchers for Scale
    Akanksha Jain
    Carlos Villavieja
    Baris Kasikci
    Parthasarathy Ranganathan
    Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, United States(2024)
    Preview abstract This paper presents Limoncello, a novel software system that dynamically configures data prefetching for high utilization systems. We demonstrate that in resource-constrained environments, such as large data centers, traditional methods of hardware prefetching can increase memory latency and decrease available memory bandwidth. To address this, Limoncello dynamically configures data prefetching, disabling hardware prefetchers when memory bandwidth utilization is high and leveraging targeted software prefetching to reduce cache misses when hardware prefetchers are disabled. Limoncello is software-centric and does not require any modifications to hardware. Our evaluation of the deployment on a real-world hyperscale system reveals that Limoncello unlocks significant performance gains for high-utilization systems: it improves application throughput by 10%, due to a 15% reduction in memory latency, while maintaining minimal change in cache miss rate for targeted library functions. View details
    Characterizing a Memory Allocator at Warehouse Scale
    Zhuangzhuang Zhou
    Nilay Vaish
    Patrick Xia
    Christina Delimitrou
    Parthasarathy Ranganathan
    Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Association for Computing Machinery, La Jolla, CA, USA(2024), 192–206
    Preview abstract Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings. We present the first comprehensive characterization study of TCMalloc, a warehouse-scale memory allocator used in our production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we redesign TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs. View details
    Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
    Soroush Ghodrati
    Selene Moon
    Martin Maas
    ASPLOS 2024, Association for Computing Machinery
    Preview abstract Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual “what-if” analyses that assess the impact of anticipated or hypothetical new storage policies or hardware prior to deployment. We propose Thesios, a methodology to accurately synthesize such hypothetical full-resolution I/O traces by carefully combining down-sampled I/O traces collected from multiple disks attached to multiple storage servers. Applying this approach to real-world traces that a real ready routinely sampled at Google, we show that our synthesized traces achieve 95–99.5% accuracy in read/write request numbers, 90–97% accuracy in utilization, and 80–99.8% accuracy in read latency compared to metrics collected from actual disks. We demonstrate how The-sios enables diverse counterfactual I/O trace synthesis and analyses of hypothetical policy, hardware, and server changes through four case studies: (1) studying the effects of changing disk’s utilization, fullness, and capacity, (2) evaluating new data placement policy, (3) analyzing the impact on power and performance of deploying disks with reduced rotations-per-minute (RPM), and (4) understanding the impact of increased buffer cache size on a storage server. Without Thesios, such counterfactual analyses would require costly and potentially risky A/B experiments in production. View details
    PROMPT: A Fast and Extensible Memory Profiling Framework
    Ziyang Xu
    Yebin Chon
    Yian Su
    Zujun Tan
    Simone Campanoni
    David I. August
    Proceedings of the ACM on Programming Languages, 8, Issue OOPSLA(2024)
    Preview abstract Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. View details
    Preview abstract Large Language Models have been able to replicate their success from text generation to coding tasks. While a lot of work has made it clear that they have remarkable performance on tasks such as code completion and editing, it is still unclear as to why. We help bridge this gap by exploring to what degree do auto-regressive models understand the logical constructs of the underlying programs. We propose CAPP, a counterfactual testing framework to evaluate whether large code models understand programming concepts. With only black-box access to the model, we use CAPP to evaluate 10 popular large code models for 5 different programming concepts. Our findings suggest that current models lack understanding of concepts such as data flow and control flow. View details
    You Only Linearize Once: Tangents Transpose to Gradients
    Alexey Radul
    Adam Paszke
    Roy Frostig
    Matthew Johnson
    Dougal Maclaurin
    POPL(2023), pp. 1246-1274
    Preview abstract Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) forward-mode AD followed by (ii) unzipping the linear and non-linear parts and then (iii) transposition of the linear part. To that end, we define a (substructurally) linear type system that can prove a class of functions are (algebraically) linear. Our main results are that forward-mode AD produces such linear functions, and that we can unzip and transpose any such linear function, conserving cost, size, and linearity. Composing these three transformations recovers reverse-mode AD. This decomposition also sheds light on checkpointing, which emerges naturally from a free choice in unzipping `let` expressions. As a corollary, checkpointing techniques are applicable to general-purpose partial evaluation, not just AD. We hope that our formalization will lead to a deeper understanding of automatic differentiation and that it will simplify implementations, by separating the concerns of differentiation proper from the concerns of gaining efficiency (namely, separating the derivative computation from the act of running it backward). View details
    Preview abstract Identifying invariants in programs is an important program analysis task with applications towards program understanding, vulnerability analysis, and formal verification. Existing tools for identifying invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models training on source code and fine-tuned to invariant prediction can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces. View details
    Grisette: Symbolic Compilation as a Functional Programming Library
    Sirui Lu
    Grisette: Symbolic Compilation as a Functional Programming Library, ACM(2023) (to appear)
    Preview abstract The development of constraint solvers simplified automated reasoning about programs and shifted the engineering burden to implementing symbolic compilation tools that translate programs into efficiently solvable constraints. We describe Grisette, a reusable symbolic evaluation framework for implementing domain-specific symbolic compilers. Grisette evaluates all execution paths and merges their states into a normal form that avoids making guards mutually exclusive. This ordered-guards representation reduces the constraint size 5-fold and the solving time more than 2-fold. Grisette is designed entirely as a library, which sidesteps the complications of lifting the host language into the symbolic domain. Grisette is purely functional, enabling memoization of symbolic compilation as well as monadic integration with host libraries. Grisette is statically typed, which allows catching programming errors at compile time rather than delaying their detection to the constraint solver. We implemented Grisette in Haskell and evaluated it on benchmarks that stress both the symbolic evaluation and constraint solving. View details
    Predicting Dynamic Properties of Heap Allocations Using Neural Networks Trained on Static Code
    Christian Navasca
    Martin Maas
    Guoqing Harry Xu
    2023 ACM SIGPLAN International Symposium on Memory Management (ISMM 2023)
    Preview abstract Memory allocators and runtime systems can leverage dynamic properties of heap allocations – such as object lifetimes, hotness or access correlations – to improve performance and resource consumption. A significant amount of work has focused on approaches that collect this information in performance profiles and then use it in new memory allocator or runtime designs, both offline (in ahead-of-time compilers) and online (in JIT compilers). This is a special instance of profile-guided optimization. This approach has significant disadvantages: 1) The profiling oftentimes introduces substantial overheads, which are prohibitive in many production scenarios, 2) Creating a representative profiling run adds significant engineering complexity and reduces deployment velocity, and 3) Profiles gathered ahead of time or during the warm-up phase of a server are often not representative of all workload behavior and may miss important corner cases. In this paper, we investigate a fundamentally different approach. Instead of deriving heap allocation properties from profiles, we explore the ability of neural network models to predict them from the statically available code. As an intellectual abstract, we do not offer a conclusive answer but describe the trade-off space of this approach, investigate promising directions, motivate these directions with data analysis and experiments, and highlight challenges that future work needs to overcome. View details
    Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor
    Sishuai Gong
    Dinglan Peng
    Pedro Fonseca
    Symposium on Operating Systems Principles (SOSP)(2023)
    Preview abstract Random-based approaches and heuristics are commonly used in kernel concurrency testing due to the massive scale of modern kernels and corresponding interleaving space. The lack of accurate and scalable approaches to analyze concurrent kernel executions makes existing testing approaches heavily rely on expensive dynamic executions to measure the effectiveness of a new test. Unfortunately, the high cost incurred by dynamic executions limits the breadth of the exploration and puts latency pressure on finding effective concurrent test inputs and schedules, hindering the overall testing effectiveness. This paper proposes Snowcat, a kernel concurrency testing framework that generates effective test inputs and schedules using a learned kernel block-coverage predictor. Using a graph neural network, the coverage predictor takes a concurrent test input and scheduling hints and outputs a prediction on whether certain important code blocks will be executed. Using this predictor, Snowcat can skip concurrent tests that are likely to be fruitless and prioritize the promising ones for actual dynamic execution. After testing the Linux kernel for over a week, Snowcat finds ∼17% more potential data races, by prioritizing tests of more fruitful schedules than existing work would have chosen. Snowcat can also find effective test inputs that expose new concurrency bugs with higher probability (1.4×∼2.6×), or reproduce known bugs more quickly (15×) than state-ofart testing tools. More importantly, Snowcat is shown to be more efficient at reaching a desirable level of race coverage in the continuous setting, as the Linux kernel evolves from version to version. In total, Snowcat discovered 17 new concurrency bugs in Linux kernel 6.1, of which 13 are confirmed and 6 are fixed. View details
    Preview abstract While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks. View details
    Code Generation for Data-Dependent Stencils
    Mohammed Essadki
    Bertrand Michel
    Bruno Maugars
    Oleksandr Zinenko
    Nicolas Vasilache
    CGO, IEEE(2023)
    Preview abstract Numerical simulation often resorts to iterative in-place stencils such as the Gauss-Seidel or Successive Overrelaxation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technology for image processing stencils, convolutions and out-of place iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. View details