
Karanveer Anand
Karanveer Anand is a technical program manager with expertise in software infrastructure and reliability. He leverages deep technical understanding to drive complex projects, mitigating risks and ensuring system stability and scalability.
Authored Publications
Sort By
Avoid global outages by partitioning cloud applications to reduce blast radius
https://cloud.google.com/ (2025)
Preview abstract
Cloud application development faces the inherent challenge of balancing rapid innovation with high availability. This blog post details how Google Workspace's Site Reliability Engineering team addresses this conflict by implementing vertical partitioning of serving stacks. By isolating application servers and storage into distinct partitions, the "blast radius" of code changes and updates is significantly reduced, minimizing the risk of global outages. This approach, which complements canary deployments, enhances service availability, provides flexibility for experimentation, and facilitates data localization. While challenges such as data model complexities and inter-service partition misalignment exist, the benefits of improved reliability and controlled deployments make partitioning a crucial strategy for maintaining robust cloud applications
View details
MidMortem should not be Optional
Dzone (2024)
Preview abstract
To ensure project success, incorporating Midmortem is essential. It aids in organization by eliminating potential risks and implementing necessary changes to reach project milestones and objectives.
View details
Project management à la SRE: How to juggle the needs of your project and production
https://cloud.google.com/ (2024)
Preview abstract
Site Reliability Engineering (SRE) teams face unique project management challenges due to their dual responsibilities of supporting production environments and executing infrastructure projects. This paper explores the common issue of project delays caused by unexpected production incidents that divert SRE resources. Through a case study of a regionalization project, the author highlights the difficulties of adhering to timelines when engineers are frequently reassigned to address operational crises. To mitigate these challenges, the paper advocates for enhanced planning strategies, specifically reserving a percentage of engineering time for production work. Based on historical data, the author's team implemented a 25% buffer, significantly improving project delivery while maintaining focus on critical production incidents. Furthermore, the paper outlines best practices for Technical Program Managers (TPMs) in SRE, including proactive staffing, cross-service collaboration, early engagement, management of external dependencies, and consistent performance evaluation. By adopting these strategies, SRE teams can effectively balance project execution and production support, ensuring timely delivery and operational stability.
View details
Evolution of Governance Framework With AI
Preview
Dzone (2024) (to appear)