Project management à la SRE: How to juggle the needs of your project and production

https://cloud.google.com/ (2024)

Abstract



Site Reliability Engineering (SRE) teams face unique project management challenges due to their dual responsibilities of supporting production environments and executing infrastructure projects. This paper explores the common issue of project delays caused by unexpected production incidents that divert SRE resources. Through a case study of a regionalization project, the author highlights the difficulties of adhering to timelines when engineers are frequently reassigned to address operational crises. To mitigate these challenges, the paper advocates for enhanced planning strategies, specifically reserving a percentage of engineering time for production work. Based on historical data, the author's team implemented a 25% buffer, significantly improving project delivery while maintaining focus on critical production incidents. Furthermore, the paper outlines best practices for Technical Program Managers (TPMs) in SRE, including proactive staffing, cross-service collaboration, early engagement, management of external dependencies, and consistent performance evaluation. By adopting these strategies, SRE teams can effectively balance project execution and production support, ensuring timely delivery and operational stability.