BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse

Justin Levandoski; Garrett Casto; Mingge Deng; Rushabh Desai; Pavan Edara; Thibaud Hottelier; Amir Hormati; Anoop Johnson; Jeff Johnson; Dawid Kurzyniec; Sam McVeety; Prem Ramanathan; Gaurav Saxena; Vidya Shanmugam; Yuri Volobuev

BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse

Justin Levandoski

Garrett Casto

Mingge Deng

Rushabh Desai

Pavan Edara

Thibaud Hottelier

Amir Hormati

Anoop Johnson

Jeff Johnson

Dawid Kurzyniec

Sam McVeety

Prem Ramanathan

Gaurav Saxena

Vidya Shanmugam

Yuri Volobuev

SIGMOD (2024)

Download Google Scholar

Abstract

BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default.

This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs