Making “Push On Green” a Reality: Issues & Actions Involved in Maintaining a Production Service

Dina M. Betser
Mathew G. Monroe
;login:, 39, number 5 (2014), pp. 26-32
Google Scholar

Abstract

Updating production software is a process that may require dozens, if not hundreds, of steps. These include creating and testing the new code, building new binaries and packages, associating the packages with a versioned release, updating the jobs in production datacenters, possibly modifying database schemata, and testing and verifying the results. There are boxes to check and approvals to seek, and the more automated the process, the easier it becomes. When releases can be made faster, it is possible to release more often, and organizationally, one becomes less afraid to “release early, release often”. This is the fundamental driving force behind the work described in this paper – making rollouts as easy and as automated as possible, so that when a “green” condition (defined below) is detected, we can more quickly perform a new rollout. Humans may still be needed somewhere in the loop, but we strive to reduce the purely mechanical toil they need to perform.

This paper describes how we, as Site Reliability Engineers working on several different Ads and Commerce services at Google, do this, and shares information on how to enable other organizations to do the same. We define Push On Green and describe the development and deployment of best practices that serve as a foundation for this kind of undertaking. Using a “sample service” at Google as an example, we look at the historical development of the mechanization of the rollout process, and discuss the steps taken to further automate it. We then examine the steps remaining, both near and long-term, as we continue to gain experience and advance the process towards full automation. We conclude with a set of concrete recommendations for other groups wishing to implement a Push On Green system that keeps production systems not only up-and-running, but also updated with as little engineer-involvement and user-visible downtime as possible.

Research Areas