Carbink: Fault-tolerant Far Memory

Yang Zhou
Sihang Liu
Jiaqi Gao
James Mickens
Minlan Yu
Hank Levy
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix(2022)

Abstract

Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory.