Borg: the Next Generation

Muhammad Tirmazi
Adam Barker
Md Ehtesam Haque
Zhijing Gene Qin
Mor Harchol-Balter
EuroSys'20, ACM, Heraklion, Crete (2020)

Abstract

This paper analyzes a newly-published trace that covers 8
different Borg clusters for the month of May 2019. The
trace enables researchers to explore how scheduling works in
large-scale production compute clusters. We highlight how
Borg has evolved and perform a longitudinal comparison of
the newly-published 2019 trace against the 2011 trace, which
has been highly cited within the research community.
Our findings show that Borg features such as alloc sets
are used for resource-heavy workloads; automatic vertical
scaling is effective; job-dependencies account for much of
the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have
migrated from the free tier into the best-effort batch tier;
the workload exhibits an extremely heavy-tailed distribution
where the top 1% of jobs consume over 99% of resources; and
there is a great deal of variation between different clusters.