Evaluation Metrics of Service-Level Reliability  Monitoring Rules of a Big Data Service

Keun Soo Yim

Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service

Keun Soo Yim

In Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2016), pp. 376-387

Google Scholar

Abstract

This paper presents new metrics to evaluate the reliability monitoring rules of a large-scale big data service. Our target service uses manually-tuned, service-level reliability monitoring rules. Using the measurement data, we identify two key technical challenges in operating our target monitoring system. In order to improve the operational efficiency, we characterize how those rules were manually tuned by the domain experts. The characterization results provide useful information to operators supposed to regularly tune such rules. Using the actual production failure data, we evaluate the same monitoring rules by using standard metrics and the presented metrics. Our evaluation results show the strengths and weaknesses of each metric and show that the presented metrics can further help operators recognize when and which rules need to be re-tuned.

Research Areas

Software systems

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs