Jump to Content

Peter H. Hochschild

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause. Please watch our short video summarizing the paper. View details
    Preview abstract CPUs are getting more complex with every generation, on both the logical and the physical levels. Unsurprisingly, this leads to more bugs and defects in CPUs being overlooked during testing, which causes data corruption or other undesirable effects when these CPUs are used in production. Some defects may also be caused by aging. If the RTL (“source code”) of a CPU is available, we could apply greybox fuzzing to the CPU model almost like any other software [Tri21]. However our targets are general purpose x86_64 CPUs produced by third parties, where we do not have the source, so in our case CPU implementations are opaque. Moreover, we are more interested in CPU defects (manufacturing problems that affect just one or several cores) as opposed to bugs (design problems that affect all cores of a given family of CPUs). In this paper we present SiliFuzz, a work-in-progress system that finds CPU defects by fuzzing proxies, like CPU simulators or disassemblers, and then executing the accumulated test vectors (“corpus”) on actual CPUs on a large scale. The major difference between this work and traditional software fuzzing is that a software bug fixed once will be fixed for all installations of this software, while with CPU defects we have to test every individual core repeatedly over its lifetime due to wear and tear. We also analyze four groups of CPU defects that SiliFuzz has uncovered and the patterns shared by other findings. View details
    Sundial: Fault-tolerant Clock Synchronization for Datacenters
    Hema Hariharan
    Dave Platt
    Simon Sabato
    Minlan Yu
    Prashant Chandra
    14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
    Preview abstract Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control. View details
    Spanner: Google's Globally Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    J. J. Furman
    Andrey Gubarev
    Christopher Heiser
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Yasushi Saito
    Michal Szymaniak
    Christopher Taylor
    Ruth Wang
    Dale Woodford
    ACM Trans. Comput. Syst., vol. 31 (2013), pp. 8
    No Results Found