
Ashish Gupta
Research Areas
Authored Publications
Sort By
F1 Query: Declarative Querying at Scale
Bart Samwel
Ahmed Aly
Thanh Do
Somayeh Sardashti
Jiexing Li
Jiacheng Yang
Chanjun Yang
Jason Govig
Andrew Harn
Zhan Yuan
Daniel Tenedorio
Colin Zheng
Allen Yan
Orri Erling
Yang Xia
Qiang Zeng
Divy Agrawal
Jun Xu
Mohan Yang
Andrey Gubichev
Felix Weigel
Yiqun Wei
Ben Handy
Anurag Biyani
Ian Rae
Amr El-Helw
Shivakumar Venkataraman
David G Wilhite
PVLDB (2018), pp. 1835-1848
Preview abstract
F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems (e.g., BigTable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines transforming data from multiple data sources into formats more suitable for analysis and reporting. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.
View details
Ubiq: A Scalable and Fault-tolerant Log Processing Infrastructure
Vinny Ganeshan
Venkatesh Basker
Tianhao Qiu
Namit Sikka
Navin Melville
Scott Holzer
Divy Agrawal
Alexander Smolyanov
Haifeng Jiang
Manish Bhatia
Shan He
Yuri Vasilevski
Monica Chawathe Lenart
Shivakumar Venkataraman
Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2016)
Preview abstract
Most of today’s Internet applications are data-centric and generate vast amounts of data (typically, in the form of event logs) that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. In this paper, we describe the architecture of Ubiq, a geographically distributed framework for processing continuously growing log files in real time with high scalability, high availability and low latency. The Ubiq framework fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. It also guarantees exactly-once semantics for application pipelines to process logs in the form of event bundles. Ubiq has been in production for Google’s advertising system for many years and has served as a critical log processing framework for hundreds of pipelines. Our production deployment demonstrates linear scalability with machine resources, extremely high availability even with underlying infrastructure failures, and an end-to-end latency of under a minute.
View details
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2015) (to appear)
Preview abstract
Google’s Ads Data Infrastructure systems run the multi-
billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems
handle machine-level failures well, handling datacenter-level failures is
less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g.
Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.
This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we
have learned in building and running these large-scale systems for over ten years.
View details
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
Shuo Wu
Fan Yang
Sandeep Dhoot
Adam Kirsch
David Jones
Jason Govig
Kevin Lai
Masood Siddiqi
Jamie Cameron
Kelvin Chan
Divyakant Agrawal
Abhilash Kumar
Mingsheng Hong
Andrey Gubarev
Shivakumar Venkataraman
VLDB (2014)
Preview abstract
Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.
View details
Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
Alexey Reznichenko
Deomid Ryabkov
Venkatesh Basker
Rajagopal Ananthanarayanan
Tianhao Qiu
Haifeng Jiang
Sumit Das
Shivakumar Venkataraman
SIGMOD '13: Proceedings of the 2013 international conference on Management of data, ACM, New York, NY, USA, pp. 577-588
Preview abstract
Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually.
Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.
View details