Jump to Content
Abhishek Verma

Abhishek Verma

Abhishek Verma received his PhD in Computer Science from the University of Illinois at Urbana-Champaign in 2012. His thesis focussed on performance modeling of MapReduce environments. After that, he worked on Borg at Google till 2014. Later he worked on designing and implementing a scheduler to run Cassandra on Mesos at Uber till 2017. His current focus is on using machine learning to optimize resource and infrastructure allocation.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large-scale cluster management at Google with Borg
    Luis Pedrosa
    Madhukar R. Korupolu
    David Oppenheimer
    Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France (2015)
    Preview abstract Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it. View details
    Preview abstract One of the key factors in selecting a good scheduling algorithm is using an appropriate metric for comparing schedulers. But which metric should be used when evaluating schedulers for warehouse-scale (cloud) clusters, which have machines of different types and sizes, heterogeneous workloads with dependencies and constraints on task placement, and long-running services that consume a large fraction of the total resources? Traditional scheduler evaluations that focus on metrics such as queuing delay, makespan, and running time fail to capture important behaviors – and ones that rely on workload synthesis and scaling often ignore important factors such as constraints. This paper explains some of the complexities and issues in evaluating warehouse scale schedulers, focusing on what we find to be the single most important aspect in practice: how well they pack long-running services into a cluster. We describe and compare four metrics for evaluating the packing efficiency of schedulers in increasing order of sophistication: aggregate utilization, hole filling, workload inflation and cluster compaction. View details
    No Results Found