Hardware and Architecture

The machinery that powers many of our interactions today — Web search, social networking, email, online video, shopping, game playing — is made of the smallest and the most massive computers. The smallest part is your smartphone, a machine that is over ten times faster than the iconic Cray-1 supercomputer. The capabilities of these remarkable mobile devices are amplified by orders of magnitude through their connection to Web services running on building-sized computing systems that we call Warehouse-scale computers (WSCs).

Google’s engineers and researchers have been pioneering both WSC and mobile hardware technology with the goal of providing Google programmers and our Cloud developers with a unique computing infrastructure in terms of scale, cost-efficiency, energy-efficiency, resiliency and speed. The tight collaboration among software, hardware, mechanical, electrical, environmental, thermal and civil engineers result in some of the most impressive and efficient computers in the world.

Recent Publications

Pathfinder: High-Resolution Control-Flow Attacks with Conditional Branch Predictor
Andrew Kwong
Archit Agarwal
Christina Garman
Daniel Genkin
Dean Tullsen
Deian Stefan
Hosein Yavarzadeh
Max Christman
Mohammadkazem Taram
International Conference on Architectural Support for Programming Languages and Operating Systems, ACM(2024)
Preview abstract This paper presents novel attack primitives that provide adversaries with the ability to read and write the path history register (PHR) and the prediction history tables (PHTs) of the conditional branch predictor in modern Intel CPUs. These primitives enable us to recover the recent control flow (the last 194 taken branches) and, in most cases, a nearly unlimited control flow history of any victim program. Additionally, we present a tool that transforms the PHR into an unambiguous control flow graph, encompassing the complete history of every branch. This work provides case studies demonstrating the practical impact of novel reading and writing/poisoning primitives. It includes examples of poisoning AES to obtain intermediate values and consequently recover the secret AES key, as well as recovering a secret image by capturing the complete control flow of libjpeg routines. Furthermore, we demonstrate that these attack primitives are effective across virtually all protection boundaries and remain functional in the presence of all recent control-flow mitigations from Intel. View details
Limoncello: Prefetchers for Scale
Carlos Villavieja
Baris Kasikci
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, United States(2024)
Preview abstract This paper presents Limoncello, a novel software system that dynamically configures data prefetching for high utilization systems. We demonstrate that in resource-constrained environments, such as large data centers, traditional methods of hardware prefetching can increase memory latency and decrease available memory bandwidth. To address this, Limoncello dynamically configures data prefetching, disabling hardware prefetchers when memory bandwidth utilization is high and leveraging targeted software prefetching to reduce cache misses when hardware prefetchers are disabled. Limoncello is software-centric and does not require any modifications to hardware. Our evaluation of the deployment on a real-world hyperscale system reveals that Limoncello unlocks significant performance gains for high-utilization systems: it improves application throughput by 10%, due to a 15% reduction in memory latency, while maintaining minimal change in cache miss rate for targeted library functions. View details
ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
Shiwei Liu
Guanchen Tao
Yifei Zou
Derek Chow
Zichen Fan
Kauna Lei
Bangfei Pan
Dennis Sylvester
Mehdi Saligane
Preview abstract The self-attention mechanism sets transformer-based large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention. Apart from the non-linearity, the low arithmetic intensity greatly reduces the processing parallelism, which becomes the bottleneck especially when dealing with a longer context. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design as an efficient Softmax alternative. ConSmax employs differentiable normalization parameters to remove the maximum searching and denominator summation in Softmax. It allows for massive parallelization while performing the critical tasks of Softmax. In addition, a scalable ConSmax hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless non-linear operation and support mix-precision computing. It further facilitates efficient LLM inference. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2 mW and area of 0.0008 mm^2 at 1250-MHz working frequency and 16-nm CMOS technology. Compared to state-of-the-art Softmax hardware, ConSmax results in 3.35x power and 2.75x area savings with a comparable accuracy on a GPT-2 model and the WikiText103 dataset. View details
CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems
Ani Udipi
JunSun Choi
Joonho Whangbo
Jerry Zhao
Edwin Lim
Vrishab Madduri
Yakun Sophia Shao
Borivoje Nikolic
Krste Asanovic
Proceedings of the 50th Annual International Symposium on Computer Architecture, Association for Computing Machinery, New York, NY, USA(2023)
Preview abstract General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities. To navigate this vast design space, we present the first large-scale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs. Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTL-based CDPU generator that supports many run-time and compile-time parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet. Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46× range in CDPU speedup, 3× range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10× to 16× faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area. View details
PTStore: Lightweight Architectural Support for Page Table Isolation
Wende Tan
Yangyu Chen
Yuan Li
Ying Liu
Jianping Wu
Chao Zhang
2023 60th ACM/IEEE Design Automation Conference (DAC), IEEE, pp. 1-6
Preview abstract Page tables are critical data structures in kernels, serving as the trust base of most mitigation solutions. Their integrity is thus crucial but is often taken for granted. Existing page table protection solutions usually provide insufficient security guarantees, require heavy hardware, or introduce high overheads. In this paper, we present a novel lightweight hardware-software co-design solution, PTStore, consisting of a secure region storing page tables and tokens verifying page table pointers. Evaluation results on FPGA-based prototypes show that PTStore only introduces <0.92% hardware overheads and <0.86% performance overheads, but provides strong security guarantees, showing that PTStore is efficient and effective. View details
Preview abstract Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale. This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies. We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model. View details