Snap & Replay: A new way to analyze uarch-scale performance bottlenecks for ML accelerators

Ioannis Zarkadas; Amanda Tomlinson; Asaf Cidon; Baris Kasikci; Ofir Weisse

Snap & Replay: A new way to analyze uarch-scale performance bottlenecks for ML accelerators

Ioannis Zarkadas

Amanda Tomlinson

Asaf Cidon

Baris Kasikci

Ofir Weisse

Proceedings of the 2025 ACM Symposium on Cloud Computing, Association for Computing Machinery, 283–298

Download Google Scholar

Abstract

As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present SnR, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. SnR captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model’s performance. We implement SnR for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Based on these findings, we implement optimizations that have decreased token generation latency for our already highly-optimized production LLMs by up to 4.1%.

(*) Ioannis Zarkadas and Amanda Tomlinson are equal-contribution co-authors to this work.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Snap & Replay: A new way to analyze uarch-scale performance bottlenecks for ML accelerators

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs