Optimal Probing with Statistical Guarantees for Network Monitoring at Scale

Branislav Kveton
Jehangir Amjad
Dimitris Konomis
Augustin Soule
Shawn Yang
Computer Communication, 192 (2022), pp. 119-131 (to appear)

Abstract

Monitoring large-scale cloud networks is a complex task because their scale is prohibitively large, monitoring budgets are limited, network topologies are not entirely regular and the estimates produced are a function of traffic patterns. In this work, we take a statistical approach to estimating a network metric, such as the latency of a set of paths, with guarantees on the estimation error. We aim to do so in an intelligent and scalable manner, without observing all existing traffic, and minimizing the estimation error at a fixed probing budget per unit of time. Our proposed algorithms produce a distribution of probes/samples across network paths which can be used in conjunction with existing probers (or samplers). These algorithms are based on A- and E-optimal experimental designs in statistics, which guarantee a bounded estimation error for any monitoring budget. Unfortunately, these designs are too computationally intensive to be used in production at scale. We propose a scalable and near-optimal approximate implementations based on the Frank-Wolfe algorithm. We validate our approaches with two metrics (latency and loss) in simulations on real network topologies, and also using a production probing system in a real cloud network. We show major gains in reducing the probing budget compared to both production and academic baselines, while maintaining low errors in estimates, even with very low probing budgets.

Research Areas