
Peter H. Hochschild
Authored Publications
Sort By
Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing
Rama Govindaraju
Eric Liu
Subhasish Mitra
Mike Fuller
IEEE (2025) (to appear)
Preview abstract
Summary:
Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing" highlights a critical issue: manufacturing defects, dubbed "test escapes," are evading current testing methods at an alarming rate, ten times higher than industry targets. These defects lead to Silent Data Corruption (SDC), where applications produce incorrect outputs without error indications, costing companies significantly in debugging, data recovery, and service disruptions. The paper proposes a three-pronged approach: quick diagnosis of defective chips directly from system-level behaviors, in-field detection using advanced testing and error detection techniques like CASP, and new, rigorous test experiments to validate these solutions and improve manufacturing testing practices.
View details
Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing
Subhasish Mitra
Rama Govindaraju
Eric Liu
IEEE Design & Test (2025) (to appear)
Preview abstract
Too many defective compute chips are escaping today’s manufacturing tests – at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach outlining future directions for overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.
View details
Preview abstract
CPUs are getting more complex with every generation, on both the logical and the physical levels. Unsurprisingly, this leads to more bugs and defects in CPUs being overlooked during testing, which causes data corruption or other undesirable effects when these CPUs are used in production. Some defects may also be caused by aging.
If the RTL (“source code”) of a CPU is available, we could apply greybox fuzzing to the CPU model almost like any other software [Tri21]. However our targets are general purpose x86_64 CPUs produced by third parties, where we do not have the source, so in our case CPU implementations are opaque. Moreover, we are more interested in CPU defects (manufacturing problems that affect just one or several cores) as opposed to bugs (design problems that affect all cores of a given family of CPUs).
In this paper we present SiliFuzz, a work-in-progress system that finds CPU defects by fuzzing proxies, like CPU simulators or disassemblers, and then executing the accumulated test vectors (“corpus”) on actual CPUs on a large scale. The major difference between this work and traditional software fuzzing is that a software bug fixed once will be fixed for all installations of this software, while with CPU defects we have to test every individual core repeatedly over its lifetime due to wear and tear. We also analyze four groups of CPU defects that SiliFuzz has uncovered and the patterns shared by other findings.
View details
Cores that don't count
Rama Krishna Govindaraju
Proc. 18th Workshop on Hot Topics in Operating Systems (HotOS 2021)
Preview abstract
We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation.
We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
Please watch our short video summarizing the paper.
View details
Sundial: Fault-tolerant Clock Synchronization for Datacenters
Minlan Yu
Prashant Chandra
Hema Hariharan
Dave Platt
Simon Sabato
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
Preview abstract
Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control.
View details
Spanner: Google's Globally Distributed Database
Preview
Michael Epstein
Andrew Fikes
Christopher Frost
J. J. Furman
Andrey Gubarev
Christopher Heiser
Sebastian Kanthak
Eugene Kogan
Hongyi Li
Sergey Melnik
David Mwaura
David Nagle
Rajesh Rao
Lindsay Rolig
Yasushi Saito
Michal Szymaniak
Christopher Taylor
Ruth Wang
Dale Woodford
ACM Trans. Comput. Syst., 31 (2013), pp. 8