Ashish Gupta
Research Areas
Authored Publications
Sort By
F1 Query: Declarative Querying at Scale
Bart Samwel
Ben Handy
Jason Govig
Chanjun Yang
Daniel Tenedorio
Felix Weigel
David G Wilhite
Jiacheng Yang
Jun Xu
Jiexing Li
Zhan Yuan
Qiang Zeng
Ian Rae
Anurag Biyani
Andrew Harn
Yang Xia
Andrey Gubichev
Amr El-Helw
Orri Erling
Allen Yan
Mohan Yang
Yiqun Wei
Thanh Do
Colin Zheng
Somayeh Sardashti
Ahmed Aly
Divy Agrawal
Shivakumar Venkataraman
PVLDB (2018), pp. 1835-1848
Preview abstract
F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems (e.g., BigTable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines transforming data from multiple data sources into formats more suitable for analysis and reporting. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.
View details
Ubiq: A Scalable and Fault-tolerant Log Processing Infrastructure
Alexander Smolyanov
Divy Agrawal
Haifeng Jiang
Manish Bhatia
Monica Chawathe Lenart
Namit Sikka
Navin Melville
Scott Holzer
Shan He
Shivakumar Venkataraman
Tianhao Qiu
Venkatesh Basker
Vinny Ganeshan
Yuri Vasilevski
Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2016)
Preview abstract
Most of today’s Internet applications are data-centric and generate vast amounts of data (typically, in the form of event logs) that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. In this paper, we describe the architecture of Ubiq, a geographically distributed framework for processing continuously growing log files in real time with high scalability, high availability and low latency. The Ubiq framework fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. It also guarantees exactly-once semantics for application pipelines to process logs in the form of event bundles. Ubiq has been in production for Google’s advertising system for many years and has served as a critical log processing framework for hundreds of pipelines. Our production deployment demonstrates linear scalability with machine resources, extremely high availability even with underlying infrastructure failures, and an end-to-end latency of under a minute.
View details
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2015) (to appear)
Preview abstract
Google’s Ads Data Infrastructure systems run the multi-
billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems
handle machine-level failures well, handling datacenter-level failures is
less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g.
Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.
This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we
have learned in building and running these large-scale systems for over ten years.
View details
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
Fan Yang
Jason Govig
Adam Kirsch
Kelvin Chan
Kevin Lai
Shuo Wu
Sandeep Dhoot
Abhilash Kumar
Mingsheng Hong
Jamie Cameron
Masood Siddiqi
David Jones
Andrey Gubarev
Shivakumar Venkataraman
Divyakant Agrawal
VLDB (2014)
Preview abstract
Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.
View details
Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
Rajagopal Ananthanarayanan
Venkatesh Basker
Sumit Das
Haifeng Jiang
Tianhao Qiu
Alexey Reznichenko
Deomid Ryabkov
Shivakumar Venkataraman
SIGMOD '13: Proceedings of the 2013 international conference on Management of data, ACM, New York, NY, USA, pp. 577-588
Preview abstract
Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually.
Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.
View details