Peter H. Hochschild
Authored Publications
Sort By
Preview abstract
CPUs are getting more complex with every generation, on both the logical and the physical levels. Unsurprisingly, this leads to more bugs and defects in CPUs being overlooked during testing, which causes data corruption or other undesirable effects when these CPUs are used in production. Some defects may also be caused by aging.
If the RTL (“source code”) of a CPU is available, we could apply greybox fuzzing to the CPU model almost like any other software [Tri21]. However our targets are general purpose x86_64 CPUs produced by third parties, where we do not have the source, so in our case CPU implementations are opaque. Moreover, we are more interested in CPU defects (manufacturing problems that affect just one or several cores) as opposed to bugs (design problems that affect all cores of a given family of CPUs).
In this paper we present SiliFuzz, a work-in-progress system that finds CPU defects by fuzzing proxies, like CPU simulators or disassemblers, and then executing the accumulated test vectors (“corpus”) on actual CPUs on a large scale. The major difference between this work and traditional software fuzzing is that a software bug fixed once will be fixed for all installations of this software, while with CPU defects we have to test every individual core repeatedly over its lifetime due to wear and tear. We also analyze four groups of CPU defects that SiliFuzz has uncovered and the patterns shared by other findings.
View details
Cores that don't count
Rama Krishna Govindaraju
Proc. 18th Workshop on Hot Topics in Operating Systems (HotOS 2021)
Preview abstract
We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation.
We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
Please watch our short video summarizing the paper.
View details
Sundial: Fault-tolerant Clock Synchronization for Datacenters
Hema Hariharan
Dave Platt
Simon Sabato
Minlan Yu
Prashant Chandra
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
Preview abstract
Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control.
View details
Spanner: Google's Globally Distributed Database
Preview
Michael Epstein
Andrew Fikes
Christopher Frost
J. J. Furman
Andrey Gubarev
Christopher Heiser
Sebastian Kanthak
Eugene Kogan
Hongyi Li
Sergey Melnik
David Mwaura
David Nagle
Rajesh Rao
Lindsay Rolig
Yasushi Saito
Michal Szymaniak
Christopher Taylor
Ruth Wang
Dale Woodford
ACM Trans. Comput. Syst., 31 (2013), pp. 8