Kelp: QoS for Accelerators in Machine Learning Platforms

Haishan Zhu; David Lo; Liqun Cheng; Rama Govindaraju; Parthasarathy Ranganathan; Mattan Erez

Kelp: QoS for Accelerators in Machine Learning Platforms

Haishan Zhu

David Lo

Liqun Cheng

Rama Govindaraju

Parthasarathy Ranganathan

Mattan Erez

International Symposium on High Performance Computer Architecture (2019)

Download Google Scholar

Abstract

Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning for large models.

In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Kelp: QoS for Accelerators in Machine Learning Platforms

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs