Steven Hand
Research Areas
Authored Publications
Sort By
GASS: GPU Automated Sharing at Scale
Dragos Sbirlea
Jiafan Zhu
Konstantinos Menychtas
Yuang Liu
Zhijing Gene Qin
The IEEE International Conference on Cloud Computing (CLOUD) 2024 (2024)
Preview abstract
General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, our experience with a large scale production deployment shows that typical GPU work-loads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU Automated Sharing at Scale (GASS). GASS relies on fine-grained time-multiplexing to let GPU compute resources be shared among different tasks, and on-demand paging to let GPU memory be shared among them. GASS mitigates sharing performance anomalies by using real-time performance monitoring to drive adaptive rescheduling. Our cluster level evaluation shows the aggregated GPU throughput is increased by 50% under GASS and that sharing enables the cluster to support 19% more GPU jobs.
View details
Pathways: Asynchronous Distributed Dataflow for ML
Aakanksha Chowdhery
Ruoming Pang
Sudip Roy
Brennan Saeta
Parker Edward Schuh
Ryan Sepassi
MLSys 2022 (2022) (to appear)
Preview abstract
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
View details
Borg: the Next Generation
Muhammad Tirmazi
Adam Barker
Md Ehtesam Haque
Zhijing Gene Qin
Mor Harchol-Balter
EuroSys'20, ACM, Heraklion, Crete (2020)
Preview abstract
This paper analyzes a newly-published trace that covers 8
different Borg clusters for the month of May 2019. The
trace enables researchers to explore how scheduling works in
large-scale production compute clusters. We highlight how
Borg has evolved and perform a longitudinal comparison of
the newly-published 2019 trace against the 2011 trace, which
has been highly cited within the research community.
Our findings show that Borg features such as alloc sets
are used for resource-heavy workloads; automatic vertical
scaling is effective; job-dependencies account for much of
the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have
migrated from the free tier into the best-effort batch tier;
the workload exhibits an extremely heavy-tailed distribution
where the top 1% of jobs consume over 99% of resources; and
there is a great deal of variation between different clusters.
View details
Autopilot: Workload Autoscaling at Google Scale
Paweł Findeisen
Jacek Świderski
Przemyslaw Broniek
Beata Strack
Piotr Witusowski
Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)
Preview abstract
In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.
To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.
Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.
View details
Firmament: Fast, Centralized Cluster Scheduling at Scale
Ionel Gog
Malte Schwarzkopf
Adam Gleave
Robert N. M. Watson
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association (2016), pp. 99-115 (to appear)
Preview abstract
Centralized datacenter schedulers can make high-quality
placement decisions when scheduling tasks in a cluster.
Today, however, high-quality placements come at
the cost of high latency at scale, which degrades response
time for interactive tasks and reduces cluster utilization.
This paper describes Firmament, a centralized scheduler
that scales to over ten thousand machines at subsecond
placement latency even though it continuously
reschedules all tasks via a min-cost max-flow (MCMF)
optimization. Firmament achieves low latency by using
multiple MCMF algorithms, by solving the problem incrementally,
and via problem-specific optimizations.
Experiments with a Google workload trace from a
12,500-machine cluster show that Firmament improves
placement latency by 20× over Quincy [22], a prior
centralized scheduler using the same MCMF optimization.
Moreover, even though Firmament is centralized, it
matches the placement latency of distributed schedulers
for workloads of short tasks. Finally, Firmament exceeds
the placement quality of four widely-used centralized
and distributed schedulers on a real-world cluster,
and hence improves batch task response time by 6×
View details