Detection and Prevention of Silent Data Corruption in an Exabyte-scale Database System
Abstract
Google’s Spanner database serves multiple exabytes of data at well over a billion queries per second, distributed over a significant fraction of Google’s fleet. Silent data corruption events due to hardware error are detected/prevented by Spanner several times per week.
For every detected error there are some number of undetected errors that in rare (but not black swan) events cause corruption either transiently for reads or durably for writes, potentially violating the most fundamental contract that a database system makes with its users: to store and retrieve data with absolute reliability and availability.
We describe the work we have done to detect and prevent silent data corruptions and (equally importantly) to remove faulty machines from the fleet, both manually and automatically. We present a simplified analytic model of corruption that provides some insights into the most effective ways to prevent end-user corruption events.
We have made qualitative gains in detection and prevention of SDC events, but quantitative analysis remains difficult. We discuss various potential trajectories in hardware (un)reliability and how they will affect our ability to build reliable database systems on commodity hardware.
For every detected error there are some number of undetected errors that in rare (but not black swan) events cause corruption either transiently for reads or durably for writes, potentially violating the most fundamental contract that a database system makes with its users: to store and retrieve data with absolute reliability and availability.
We describe the work we have done to detect and prevent silent data corruptions and (equally importantly) to remove faulty machines from the fleet, both manually and automatically. We present a simplified analytic model of corruption that provides some insights into the most effective ways to prevent end-user corruption events.
We have made qualitative gains in detection and prevention of SDC events, but quantitative analysis remains difficult. We discuss various potential trajectories in hardware (un)reliability and how they will affect our ability to build reliable database systems on commodity hardware.