David Culler
David Culler joined Google after 31 years at the University of California Berkeley pioneering extreme networked systems, from laying the foundations of clusters, Internet services and planetary scale systems to making low power embedded wireless sensor networks a reality. His work, represented in over 300 publications, 10 test-of-time awards, numerous best papers, 34 patents and the seminal textbook on parallel computer architecture, is reflected in his role in the National Academy of Engineering, where he serves on the Computer Science and Telecommunications Board and on several national studies. His academic career is punctuated with industrial phases, including Sun Microsystems, founding director of Intel Research Berkeley, and co-founding Arch Rock, now part of CISCO, and administrative roles, including Chair of EECS and founding Dean of the Berkeley Division of Data Sciences. His recent work brings network systems to the building environment to address sustainability and resilience. David is an ACM Fellow, IEEE Fellow, recipient of the SIGMOBILE Outstanding Contribution Award and the Okawa Prize.
Authored Publications
Sort By
Carbink: Fault-tolerant Far Memory
Yang Zhou
Sihang Liu
Jiaqi Gao
James Mickens
Minlan Yu
Hank Levy
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix (2022)
Preview abstract
Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory.
View details
Understanding Host Interconnect Congestion
Khaled Elmeleegy
Masoud Moshref
Rachit Agarwal
Saksham Agarwal
Sylvia Ratnasamy
Association for Computing Machinery, New York, NY, USA (2022), 198–204
Preview abstract
We present evidence and characterization of host congestion in production clusters: adoption of high-bandwidth access links leading to emergence of bottlenecks within the host interconnect (NIC-to-CPU data path). We demonstrate that contention on existing IO memory management units and/or the memory subsystem can significantly reduce the available NIC-to-CPU bandwidth, resulting in hundreds of microseconds of queueing delays and eventual packet drops at hosts (even when running a state-of-the-art congestion control protocol that accounts for CPU-induced host congestion). We also discuss implications of host interconnect congestion to design of future host architecture, network stacks and network protocols.
View details
Cores that don't count
Rama Krishna Govindaraju
Proc. 18th Workshop on Hot Topics in Operating Systems (HotOS 2021)
Preview abstract
We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation.
We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
Please watch our short video summarizing the paper.
View details