CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems

Sagar Karandikar; Ani Udipi; JunSun Choi; Joonho Whangbo; Jerry Zhao; Svilen Kanev; Edwin Lim; Jyrki Antero Alakuijala; Vrishab Madduri; Yakun Sophia Shao; Borivoje Nikolic; Krste Asanovic; Parthasarathy Ranganathan

CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems

Sagar Karandikar

Ani Udipi

JunSun Choi

Joonho Whangbo

Jerry Zhao

Svilen Kanev

Edwin Lim

Jyrki Antero Alakuijala

Vrishab Madduri

Yakun Sophia Shao

Borivoje Nikolic

Krste Asanovic

Parthasarathy Ranganathan

Proceedings of the 50th Annual International Symposium on Computer Architecture, Association for Computing Machinery, New York, NY, USA (2023)

Download Google Scholar

Abstract

General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities.

To navigate this vast design space, we present the first large-scale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs.

Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTL-based CDPU generator that supports many run-time and compile-time parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet.

Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46× range in CDPU speedup, 3× range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10× to 16× faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area.

Research Areas

Algorithms and theory

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs