Pavan Edara
Research Areas
Authored Publications
Sort By
Preview abstract
Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data.
View details
BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
Justin Levandoski
Garrett Casto
Mingge Deng
Rushabh Desai
Thibaud Hottelier
Amir Hormati
Jeff Johnson
Dawid Kurzyniec
Prem Ramanathan
Gaurav Saxena
Vidya Shanmugam
Yuri Volobuev
SIGMOD (2024)
Preview abstract
BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default.
This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data.
View details
Preview abstract
The rapid emergence of cloud data warehouses like Google BigQuery has redefined the landscape of data analytics. With the growth of data volumes, such systems need to scale to tens to hundreds of EiB of data in the near future. This growth is accompanied by an increase in the number of objects stored and the amount of metadata such systems need to manage. Traditionally, Big Data systems have tried to reduce the amount of metadata in order to scale the system, often trading off with performance. In Google BigQuery, we built a metadata management system that demonstrates that massive scale can be achieved without such tradeoffs. We use the same distributed query processing and data management techniques that we use for managing data to handle Big metadata. Today, BigQuery uses these techniques to support queries over billions of objects and their metadata.
View details