Forecasting Extreme Production Outages in Agile, Big Data and Machine Learning Services: Simple, Two-Parameter Software Reliability Models for Root Cause Insights

2025 IEEE International Conference on Big Data (BigData), IEEE (2025), pp. 3914-3923

Abstract

Time series forecasting models have diverse real-world applications, yet forecasting sporadic or spiky production outages of cloud computing services remains a challenging target. Traditional one-parameter Software Reliability Growth Models (SRGMs) are inadequate for accurately estimating outages in modern agile software environments for big data computing. This inadequacy stems from the continuous introduction and removal of defects, constantly evolving total defect counts, and non-constant defect detection rates in agile software, further complicated by operational issues like release and deployment challenges contributing to outages. In this paper, we address these limitations by optimizing a fundamental reliability model to estimate aggregated time series of sporadic, spiky production outages of big data machine learning (ML) services. Our analysis utilizes three years of production incident statistics from planet-scale services with billions of users. We conduct a comprehensive curve fitting study across daily, weekly, and monthly aggregated outage counts against a total of 55 standard distribution functions. We empirically demonstrate that two-parameter distributions, specifically beta and wrapped Cauchy, consistently provide the best fit for total production outages across all granularities, highlighting the necessity of multi-parameter models for agile software reliability. Furthermore, by classifying outages by their root cause type (e.g., experiments, ML, and migration) based on manual post-mortem analyses, we find that root cause-specific outages often represent even more extreme events than total outage counts, requiring two- or multi-parameter models for accurate forecasting. This granular understanding is crucial for big data service operators (e.g., on-call engineers) to identify root causes and apply mitigation techniques in a timely manner.
×