Network Error Logging: Client-side measurement of end-to-end web service reliability

Ben Jones
Brian Rogan
Charles Stahl
Douglas Creager
Harsha V. Madhyastha
Ilya Grigorik
Julia Elizabeth Tuttle
Lily Chen
Misha Efimov
17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020
Google Scholar

Abstract

We present NEL (Network Error Logging), Google’s planet scale, client-side, network reliability measurement system. NEL is implemented in Chrome and has been proposed as a new W3C standard, letting any web site operator collect reports of clients’ successful and failed requests to their sites. These reports are similar to web server logs, but include information about failed requests that never reach serving infrastructure. Reports are uploaded via redundant failover paths, reducing the likelihood of shared-fate failures of report uploads. We have used NEL to monitor all of Google’s domains since 2014, allowing us to detect and investigate instances of DNS hijacking, BGP route leaks, protocol deployment bugs, and other problems where packets might never reach our servers. This paper presents the design of NEL, case studies of real outages, and deployment lessons for other operators who choose to use NEL to monitor their traffic.