Jump to Content
Kathryn S. McKinley

Kathryn S. McKinley

Kathryn S. McKinley is a Distinguished Engineer at Google, where she designs engineering systems customized to GCE customer workloads for excellent performance and a transparent capacity experience. She leads teams that focus on infrastructure for industry leading price performance products that use Google’s and the world’s resources wisely. Her expertise spans cloud and parallel systems, with a focus on memory technologies.

Prior to joining Google she was a Principal Researcher at Microsoft, and an Endowed Professor at the University of Texas at Austin, where her research groups produced technologies that influenced industry and academia. For instance, they produced the industry leading DaCapo Java Benchmarks and benchmarking methodologies; Hoard, the first scalable and probably memory efficient memory manager, adopted by IBM and Apple’s OS X; and Immix, the first of a novel mark-region high performance garbage collection family, in use by Jikes RVM, Haxe, Rubinius, Scala, and others. Her research excellence has been recognized by numerous ACM Research Highlights, test-of-time, and best paper awards. She is a recipient of the ACM SIGPLAN Programming Languages Achievement Award and the ACM SIGPLAN Programming Language Software Award. She is an IEEE Fellow, ACM Fellow, and a member of the American Academy of Arts & Sciences.

Kathryn is passionate about inclusion and equity in computing. In 2018, she co-founded ACM CARES committees, a new type of resource to combat sexual harassment and discrimination in the computing research community. As CRA Widening Participation (CRA-WP) co-chair, she founded the CRA Center for Evaluating the Research Pipeline. This community and other service has been recognized with the CRA Distinguished Service Award, ACM SIGPLAN Distinguished Service Award, and the ACM SIGARCH Alan D. Berenbaum Distinguished Service Award. As a former Computing Research Association (CRA) board member and CRA-WP board member and co-chair, she continues to participate in and lead programs to increase the participation of women and under-represented groups in computing.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Modern C++ server workloads rely on 2 MB huge pages to improve memory system performance via higher TLB hit rates. Huge pages have traditionally been supported at the kernel level, but recent work has shown that user-level, huge page-aware memory allocators can achieve higher huge page coverage and thus performance. These memory allocators deal with a trade-off: 1) allocate memory from the operating system (OS) at the granularity of a huge page, achieve high performance, but potentially waste memory due to fragmentation, or 2) limit fragmentation by breaking up huge pages into smaller 4 KB pages and returning them to the OS, but reduce performance due to lower huge page coverage. For example, the state-of-the-art TCMalloc allocator handles this trade-off by releasing memory to the OS at a configurable release rate, breaking up huge pages as necessary. This approach balances performance and fragmentation well for machines running one workload. For multiple applications on the same machine however, the reduction in memory usage is only useful to overall performance if another workload uses this memory. In warehouse-scale computers, when an application releases and then reacquires the same amount or more memory quickly, but no other application uses the memory in the meantime, the release causes poorer huge page coverage without any system-wide benefit. We introduce a metric, realized fragmentation, to capture this effect. We then present an adaptive release policy that dynamically determines when to break up huge pages and return them to the OS to optimize system-wide performance. We built this policy into TCMalloc and deployed it fleet-wide in our data centers, leading to an estimated 1% fleet-wide throughput improvement at negligible memory overhead. View details
    Learning-based Memory Allocation for C++ Server Workloads
    David G. Andersen
    Mohammad Mahdi Javanmard
    Colin Raffel
    25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2020) (to appear)
    Preview abstract Modern C++ servers have memory footprints that vary widely over time, causing persistent heap fragmentation of up to 2x from long-lived objects allocated during peak memory usage. This fragmentation is exacerbated by the use of huge (2MB) pages, a requirement for high performance on large heap sizes. Reducing fragmentation automatically is challenging because C++ memory managers cannot move objects. This paper presents a new approach to huge page fragmentation. It combines modern machine learning techniques with a novel memory manager (LLAMA) that manages the heap based on object lifetimes and huge pages (divided into blocks and lines). A neural network-based language model predicts lifetime classes using symbolized calling contexts. The model learns context-sensitive per-allocation site lifetimes from previous runs, generalizes over different binary versions, and extrapolates from samples to unobserved calling contexts. Instead of size classes, LLAMA's heap is organized by lifetime classes that are dynamically adjusted based on observed behavior at a block granularity. LLAMA reduces memory fragmentation by up to 78% while only using huge pages on several production servers. We address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution. Although our results focus on memory allocation, the questions we identify apply to other system-level problems with strict latency and resource requirements where machine learning could be applied. View details
    Streambox: Modern Stream Processing on a Multicore Machine
    Felix Xiaozhu Lin
    Gennady Pekhimenko
    Heejin Park
    Hongyi Xin
    Myeongjae Jeon
    The USENIX Annual Technical Conference, San Jose, CA. (2017)
    Preview abstract To monitor and respond to events in real time, stream analytics have a soaring demand for high throughput and low latency. Central to meeting demand, even in a distributed system, is the performance of a single machine. This paper presents StreamBox, a novel stream processing engine that exploits the parallelism and memory hierarchies in modern multicore hardware. StreamBox executes a pipeline of transforms over records that may arrive out-of-order. For each transform, it groups records in ordered epochs based on watermark timestamps that guarantee no subsequent record timestamp will precede it. The key contribution of this work is the generalization of out-of-order record processing to out-of-order epoch processing per transform to produce abundant parallelism. We introduce a data structure called cascading containers that manages dependences and concurrency among multiple concurrent epochs in each transform and in the pipeline, maximizing available parallelism while minimizing synchronization overheads. StreamBox creates sequential memory layout of records based on temporal windows and steers record flows to optimize NUMA locality. On a 56-core machine, StreamBox processes up to 38M records per second (38 GB/s), which is comparable to a cluster of 100 – 200 CPU cores, while reducing the pipeline delay by 20× to 50 ms. View details
    The DaCapo Benchmarks: Java Benchmarking Development and Analysis
    Stephen M. Blackburn
    Robin Garner
    Chris Hoffmann
    Asjad M. Khan
    Rotem Bentzu
    Daniel Feinberg
    Daniel Frampton
    Samuel Z. Guyer
    Martin Hirzel
    Antony Hosking
    Maria Jump
    Han Lee
    J. Elliot B. Moss
    Aashish Phansalkar
    Darko Stefanovic
    Thomas VanDrunen
    Ben Wiedermann
    Proceedings of OOSPLA, ACM (2006)