Steve Gribble

Steve Gribble

Steve Gribble is a Distinguished Software Engineer and TLM at Google, where he builds host-side networking software and SDN systems that make Google’s planetary scale networks high-performance, available, debuggable, and safe to operate. In the past, Steve was a computer scientist and full professor in the University of Washington’s Department of Computer Science & Engineering. Steve had joined UW in November of 2000, after receiving his Ph.D. from UC Berkeley under Professor Eric Brewer.

Steve has co-founded two companies. In 2006, Steve co-founded SkyTap, which provides cloud-based software development, test, and deployment platforms. As well, in 1996 Steve co-founded ProxiNet, Inc., a company that built graphical web browsers for wireless Palm Pilot PDAs using scalable cloud infrastructure to optimize and render web content. ProxiNet was acquired by Pumatech in 1999.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
    Joon Ong
    Arjun Singh
    Mukarram Tariq
    Rui Wang
    Jianan Zhang
    Virginia Beauregard
    Patrick Conner
    Rishi Kapoor
    Stephen Kratzer
    Nanfang Li
    Hong Liu
    Karthik Nagaraj
    Jason Ornstein
    Samir Sawhney
    Ryohei Urata
    Lorenzo Vicisano
    Kevin Yasumura
    Shidong Zhang
    Junlan Zhou
    Proceedings of ACM SIGCOMM 2022
    Preview abstract We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect. View details
    Orion: Google’s Software-Defined Networking Control Plane
    Amr Sabaa
    Henrik Muehe
    Joon Suan Ong
    Karthik Swaminathan Nagaraj
    KondapaNaidu Bollineni
    Lorenzo Vicisano
    Mike Conley
    Min Zhu
    Rich Alimi
    Shawn Chen
    Shidong Zhang
    Waqar Mohsin
    (2021)
    Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
    Snap: a Microkernel Approach to Host Networking
    Jacob Adriaens
    Sean Bauer
    Carlo Contavalli
    Mike Dalton
    William C. Evans
    Nicholas Kidd
    Roman Kononov
    Carl Mauer
    Emily Musick
    Lena Olson
    Mike Ryan
    Erik Rubow
    Kevin Springborn
    Valas Valancius
    In ACM SIGOPS 27th Symposium on Operating Systems Principles, ACM, New York, NY, USA (2019) (to appear)
    Preview abstract This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams. View details
    Preview abstract Network virtualization stacks such as Andromeda and Virtual Filtering Platform are the linchpins of public clouds hosting Virtual Machines (VMs). The dataplane is based on a combination of high performance OS bypass software and hardware packet processing paths. A key goal is to provide network performance isolation such that workloads of one VM do not adversely impact the network experience of another VM. In this work, we characterize how isolation breakages occur in virtualization stacks and motivate predictable VM performance just as if they were operating on dedicated hardware. We formulate an abstraction of a Predictable Virtualized NIC for bandwidth, latency and packet loss. We propose three constructs to achieve predictability: egress traffic shaping, and a combination of congestion control and CPU-fair weighted fair queueing for ingress isolation. Using these constructs in coherence, we provide the illusion of a dedicated NIC to VMs, all while maintaining the raw performance of the fastpath dataplane. View details
    Preview abstract Many of today's web sites contain substantial amounts of client-side code, and consequently, they act more like programs than simple documents. This creates robustness and performance challenges for web browsers. To give users a robust and responsive platform, the browser must identify program boundaries and provide isolation between them. We provide three contributions in this paper. First, we present abstractions of web programs and program instances, and we show that these abstractions clarify how browser components interact and how appropriate program boundaries can be identified. Second, we identify backwards compatibility tradeoffs that constrain how web content can be divided into programs without disrupting existing web sites. Third, we present a multi-process browser architecture that isolates these web program instances from each other, improving fault tolerance, resource management, and performance. We discuss how this architecture is implemented in Google Chrome, and we provide a quantitative performance evaluation examining its benefits and costs. View details