
Jeff Shute
Research Areas
Authored Publications
Sort By
SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL
Matthew Brown
Xi Wu
Lulan Yu
Romit Kudtarkar
Jean-Daniel Browne
Andrey Litvinov
Shannon Bales
Michael Shen
Brandon Dolphin
Jingchi Ma
David Wilhite
John Morcos
Proc. VLDB Endow. (2024), pp. 4051-4063 (to appear)
Preview abstract
SQL has been extremely successful as the de facto standard language for working with data. Virtually all mainstream database-like systems use SQL as their primary query language. But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend. Many have observed these challenges with SQL, and proposed solutions involving new languages. New language adoption is a significant obstacle for users, and none of the potential replacements have been successful enough to displace SQL.
In GoogleSQL, we’ve taken a different approach - solving SQL’s problems by extending SQL. Inspired by a pattern that works well in other modern data languages, we added piped data flow syntax to SQL. The results are transformative - SQL becomes a flexible language that’s easier to learn, use and extend, while still leveraging the existing SQL ecosystem and existing userbase. Improving SQL from within allows incrementally adopting new features, without migrations and without learning a new language, making this a more productive approach to improve on standard SQL.
View details
Dremel: A Decade of Interactive SQL Analysis at Web Scale
Matt Tolton
Hossein Ahmadi
Narayanan Shivakumar
Theo Vassilakis
Dan Delorey
Mosha Pasumansky
Geoffrey Michael Romer
Sergey Melnik
Jing Jing Long
Andrey Gubarev
Slava Min
PVLDB (2020), pp. 3461-3472
Preview abstract
Google's Dremel was one of the first systems to combine a set of architectural principles that have become a common practice in today's cloud-native analytical systems, such as disaggregated storage and compute, in situ analysis, and columnar storage for semistructured data. In this paper, we discuss how these ideas evolved in the past decade and became the foundation for Google BigQuery.
View details
F1 Lightning: HTAP as a Service
Kelvin Lau
Jiacheng Yang
Zhan Yuan
Jeff Naughton
Ziyang Chen
Jeremy David Wood
Yuan Gao
Junxiong Zhou
Qiang Zeng
Xi Zhao
Jun Xu
Jun Ma
Ian James Rae
VLDB, VLDB Endowment (2020), ??-??
Preview abstract
The ongoing and increasing interest in HTAP (Hybrid Transactional and Analytical Processing) systems documents the intense interest from data owners in simultaneously running transactional and analytical workloads over the same data set. Much of the reported work on HTAP has arisen in the context of “green field” systems, answering the question “if we could design a system for HTAP from scratch, what would it look like?” While there is great merit in such an approach, and a lot of valuable technology has been developed with it, we found ourselves facing a different challenge: one in which there is a great deal of transactional data already existing in several transactional systems, heavily queried by an existing federated engine that does not “own” the transactional systems, supporting both new and legacy applications that demand transparent fast queries and transactions from this combination. This paper reports on our design and experiences with F1 Lightning, a system we built and deployed to meet this challenge. We describe our design decisions, some details of our implementation, and our experience with the system in production for some of Google's most demanding applications.
View details
F1 Query: Declarative Querying at Scale
Bart Samwel
Ahmed Aly
Thanh Do
Somayeh Sardashti
Jiexing Li
Jiacheng Yang
Chanjun Yang
Jason Govig
Andrew Harn
Zhan Yuan
Daniel Tenedorio
Colin Zheng
Allen Yan
Orri Erling
Yang Xia
Qiang Zeng
Divy Agrawal
Jun Xu
Mohan Yang
Andrey Gubichev
Felix Weigel
Yiqun Wei
Ben Handy
Anurag Biyani
Ian Rae
Amr El-Helw
Shivakumar Venkataraman
David G Wilhite
PVLDB (2018), pp. 1835-1848
Preview abstract
F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems (e.g., BigTable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines transforming data from multiple data sources into formats more suitable for analysis and reporting. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.
View details
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2015) (to appear)
Preview abstract
Google’s Ads Data Infrastructure systems run the multi-
billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems
handle machine-level failures well, handling datacenter-level failures is
less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g.
Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.
This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we
have learned in building and running these large-scale systems for over ten years.
View details
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
Shuo Wu
Fan Yang
Sandeep Dhoot
Adam Kirsch
David Jones
Jason Govig
Kevin Lai
Masood Siddiqi
Jamie Cameron
Kelvin Chan
Divyakant Agrawal
Abhilash Kumar
Mingsheng Hong
Andrey Gubarev
Shivakumar Venkataraman
VLDB (2014)
Preview abstract
Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.
View details
Preview abstract
We introduce a protocol for schema evolution in a globally
distributed database management system with shared data,
stateless servers, and no global membership. Our protocol
is asynchronous—it allows different servers in the database
system to transition to a new schema at different times—and
online—all servers can access and update all data during a
schema change. We provide a formal model for determining
the correctness of schema changes under these conditions,
and we demonstrate that many common schema changes can
cause anomalies and database corruption. We avoid these
problems by replacing corruption-causing schema changes
with a sequence of schema changes that is guaranteed to
avoid corrupting the database so long as all servers are no
more than one schema version behind at any time. Finally,
we discuss a practical implementation of our protocol in
F1, the database management system that stores data for
Google AdWords.
View details
F1: A Distributed SQL Database That Scales
Ben Handy
David Menestrina
Traian Stancescu
Mircea Oancea
Ian Rae
Kyle Littlefield
Stephan Ellner
Bart Samwel
Chad Whipkey
VLDB (2013)
Preview abstract
F1 is a distributed relational database system built at
Google to support the AdWords business. F1 is a hybrid
database that combines high availability, the scalability of
NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency
by using a hierarchical schema model with structured data
types and through smart application design. F1 also includes a fully functional distributed SQL query engine and
automatic change tracking and publishing.
View details
F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
Ben Handy
Beat Jegerlehner
Phoenix Tong
Xin Chen
Mircea Oancea
Kyle Littlefield
Stephan Ellner
Bart Samwel
Chad Whipkey
SIGMOD (2012)
Preview abstract
Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing, and is built on top of a highly distributed storage system that scales on standard hardware in Google data centers. The store is dynamically sharded, supports transactionally-consistent replication across data centers, and is able to handle data center outages without data loss.
The strong consistency properties of F1 and its storage system come at the cost of higher write latencies compared to MySQL. Having successfully migrated a rich customerfacing application suite at the heart of Google’s ad business to F1, with no downtime, we will describe how we restructured schema and applications to largely hide this increased latency from external users. The distributed nature of F1 also allows it to scale easily and to support significantly higher throughput for batch workloads than a traditional RDBMS.
With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS.
View details