GASS: GPU Automated Sharing at Scale
Abstract
General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, our experience with a large scale production deployment shows that typical GPU work-loads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU Automated Sharing at Scale (GASS). GASS relies on fine-grained time-multiplexing to let GPU compute resources be shared among different tasks, and on-demand paging to let GPU memory be shared among them. GASS mitigates sharing performance anomalies by using real-time performance monitoring to drive adaptive rescheduling. Our cluster level evaluation shows the aggregated GPU throughput is increased by 50% under GASS and that sharing enables the cluster to support 19% more GPU jobs.