Sam McVeety
Research Areas
Authored Publications
Sort By
BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
Justin Levandoski
Garrett Casto
Mingge Deng
Rushabh Desai
Thibaud Hottelier
Amir Hormati
Jeff Johnson
Dawid Kurzyniec
Prem Ramanathan
Gaurav Saxena
Vidya Shanmugam
Yuri Volobuev
SIGMOD (2024)
Preview abstract
BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default.
This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data.
View details
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Tyler Akidau
Craig Chambers
Reuven Lax
Daniel Mills
Frances Perry
Eric Schmidt
Proceedings of the VLDB Endowment, 8 (2015), pp. 1792-1803
Preview abstract
Unbounded, unordered, global-scale datasets are increasingly
common in day-to-day business (e.g. Web logs, mobile
usage statistics, and sensor networks). At the same time,
consumers of these datasets have evolved sophisticated requirements,
such as event-time ordering and windowing by
features of the data themselves, in addition to an insatiable
hunger for faster answers. Meanwhile, practicality dictates
that one can never fully optimize along all dimensions of correctness,
latency, and cost for these types of input. As a result,
data processing practitioners are left with the quandary
of how to reconcile the tensions between these seemingly
competing propositions, often resulting in disparate implementations
and systems.
We propose that a fundamental shift of approach is necessary
to deal with these evolved requirements in modern
data processing. We as a field must stop trying to groom unbounded
datasets into finite pools of information that eventually
become complete, and instead live and breathe under
the assumption that we will never know if or when we have
seen all of our data, only that new data will arrive, old data
may be retracted, and the only way to make this problem
tractable is via principled abstractions that allow the practitioner
the choice of appropriate tradeoffs along the axes of
interest: correctness, latency, and cost.
In this paper, we present one such approach, the Dataflow
Model, along with a detailed examination of the semantics
it enables, an overview of the core principles that guided its
design, and a validation of the model itself via the real-world
experiences that led to its development.
View details
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Tyler Akidau
Alex Balikov
Kaya Bekiroglu
Josh Haberman
Reuven Lax
Daniel Mills
Paul Nordstrom
Very Large Data Bases (2013), pp. 734-746
Preview abstract
MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees. This paper describes MillWheel's programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel's features are used. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
View details