Meaningful availability

Dan Ardelean
Philipp Emanuel Hoffmann
Tamás Hauer
17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (2020)
Google Scholar

Abstract

Accurate measurement of service availability is the cornerstone of good service management: it quantifies the gap between user expectation and system performance, and provides actionable data to prioritize development and operational tasks. We propose a novel metric, user-uptime, which is event- based but is time-sensitive and which approximates aggregated user-perceived reliability better than current metrics. For a holistic view of availability across timescales from minutes to months or quarters, we augment user-uptime with a novel aggregation and visualization paradigm: windowed uptime. Using an example from G Suite we demonstrate its effectiveness in differentiating between unreliability caused
by flakiness and an extended outage.