
Karanveer Anand
Karanveer Anand is a technical program manager with expertise in software infrastructure and reliability. He leverages deep technical understanding to drive complex projects, mitigating risks and ensuring system stability and scalability.
Authored Publications
Sort By
Avoid global outages by partitioning cloud applications to reduce blast radius
https://cloud.google.com/ (2025)
Preview abstract
Cloud application development faces the inherent challenge of balancing rapid innovation with high availability. This blog post details how Google Workspace's Site Reliability Engineering team addresses this conflict by implementing vertical partitioning of serving stacks. By isolating application servers and storage into distinct partitions, the "blast radius" of code changes and updates is significantly reduced, minimizing the risk of global outages. This approach, which complements canary deployments, enhances service availability, provides flexibility for experimentation, and facilitates data localization. While challenges such as data model complexities and inter-service partition misalignment exist, the benefits of improved reliability and controlled deployments make partitioning a crucial strategy for maintaining robust cloud applications
View details
Project management à la SRE: How to juggle the needs of your project and production
https://cloud.google.com/ (2024)
Preview abstract
Site Reliability Engineering (SRE) teams face unique project management challenges due to their dual responsibilities of supporting production environments and executing infrastructure projects. This paper explores the common issue of project delays caused by unexpected production incidents that divert SRE resources. Through a case study of a regionalization project, the author highlights the difficulties of adhering to timelines when engineers are frequently reassigned to address operational crises. To mitigate these challenges, the paper advocates for enhanced planning strategies, specifically reserving a percentage of engineering time for production work. Based on historical data, the author's team implemented a 25% buffer, significantly improving project delivery while maintaining focus on critical production incidents. Furthermore, the paper outlines best practices for Technical Program Managers (TPMs) in SRE, including proactive staffing, cross-service collaboration, early engagement, management of external dependencies, and consistent performance evaluation. By adopting these strategies, SRE teams can effectively balance project execution and production support, ensuring timely delivery and operational stability.
View details
Evolution of Governance Framework With AI
Preview
Dzone (2024) (to appear)
MidMortem should not be Optional
Dzone (2024)
Preview abstract
To ensure project success, incorporating Midmortem is essential. It aids in organization by eliminating potential risks and implementing necessary changes to reach project milestones and objectives.
View details
Reliability Engineering in Cloud Computing: Strategies, Metrics, and Performance Assessment
International Journal of Multidisciplinary Research in Science, Engineering and Technology (2023)
Preview abstract
Cloud computing has transformed the nature of computation, sharing of information resources, and storage capabilities, including the flexibility to scale these resources for corporate use. Nevertheless, maintaining high reliability in cloud environments is still an issue that has not been solved because of factors such as Hardware failures, network interruptions/slowdowns and software vulnerabilities. This paper discusses several methods that can be employed in the reliability engineering of cloud computing, including fault tolerance, redundancy, monitoring and predictive maintenance. It also further extends the basic reliability measures such as Mean Time Between Failure (MTBF), Mean Time To Repair (MTTR), Service Availability and Failure Rate, which measure system reliability and effectiveness. Moreover, the paper considers performance assessment methodologies through real-time monitoring, machine learning, and reliability assessment methods. It also addresses the nature and advancement of technologies of artificial intelligence-powered automation and self-healing applications for improved cloud dependability. The present work aims to identify the state-of-the-art state of dependability in cloud services and propose some recommendations for minimizing such costs, improving dependability levels, and reducing undesired downtime. The information is valuable for CSPs, IT designers/architects, and system engineers who wish to create fault-tolerant and optimal cloud environments.
View details