Avoid global outages by partitioning cloud applications to reduce blast radius
Abstract
Cloud application development faces the inherent challenge of balancing rapid innovation with high availability. This blog post details how Google Workspace's Site Reliability Engineering team addresses this conflict by implementing vertical partitioning of serving stacks. By isolating application servers and storage into distinct partitions, the "blast radius" of code changes and updates is significantly reduced, minimizing the risk of global outages. This approach, which complements canary deployments, enhances service availability, provides flexibility for experimentation, and facilitates data localization. While challenges such as data model complexities and inter-service partition misalignment exist, the benefits of improved reliability and controlled deployments make partitioning a crucial strategy for maintaining robust cloud applications