Gokulnath Babu Manoharan

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Progressive Partitioning for Parallelized Query Execution in Google’s Napa
    Junichi Tatemura
    Yanlai Huang
    Jim Chen
    Yupu Zhang
    Kevin Lai
    Divyakant Agrawal
    Brad Adelberg
    Shilpa Kolhar
    Indrajit Roy
    49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
    Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details
    Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
    Kevin Lai
    Indrajit Roy
    Min Chen
    Jim Chen
    Ming Dai
    Thanh Do
    Haoyu Gao
    Haoyan Geng
    Raman Grover
    Bo Huang
    Yanlai Huang
    Adam Li
    Jianyi Liang
    Tao Lin
    Li Liu
    Yao Liu
    Xi Mao
    Maya Meng
    Prashant Mishra
    Jay Patel
    Vijayshankar Raman
    Sourashis Roy
    Mayank Singh Shishodia
    Tianhang Sun
    Justin Tang
    Junichi Tatemura
    Sagar Trehan
    Ramkumar Vadali
    Prasanna Venkatasubramanian
    Joey Zhang
    Kefei Zhang
    Yupu Zhang
    Zeleng Zhuang
    Divyakanth Agrawal
    Jeff Naughton
    Sujata Sunil Kosalge
    Hakan Hacıgümüş
    Proceedings of the VLDB Endowment (PVLDB), 14 (12) (2021), pp. 2986-2998
    Preview abstract There are numerous Google services that continuously generate vast amounts of log data that are used to provide valuable insights to internal and external business users. We need to store and serve these planet-scale data sets under extremely demanding requirements of scalability, sub-second query response times, availability even in the case of entire data center failures, strong consistency guarantees, ingesting a massive stream of updates coming from the applications used around the globe. We have developed and deployed in production an analytical data management system, called Napa, to meet these requirements. Napa is the backend for multiple internal and external clients in Google so there is a strong expectation of variance-free robust query performance. At its core, Napa’s principal technologies for robust query performance include the aggressive use of materialized views that are maintained consistently as new data is ingested across multiple data centers. Our clients also demand flexibility in being able to adjust their query performance, data freshness, and costs to suit their unique needs. Robust query processing and flexible configuration of client databases are the hallmark of Napa design. Most of the related work in this area takes advantage of full flexibility to design the whole system without the need to support a diverse set of preexisting use cases, whereas Napa needs to deal with the hard constraints of applications that differ on which characteristics of the system are most important to optimize. Those constraints led us to make particular design decisions and also devise new techniques to meet the challenges. In this paper, we share our experiences in designing, implementing, deploying, and running Napa in production with some of Google’s most demanding applications. View details
    Shasta: Interactive Reporting at Scale
    Stephan Ellner
    Stephan Gudmundson
    Apurv Gupta
    Ben Handy
    Bart Samwel
    Chad Whipkey
    Larysa Aharkava
    Jun Xu
    Shivakumar Venkataraman
    Divy Agrawal
    Jeffrey D. Ullman
    SIGMOD, San Francisco, CA (2016) (to appear)
    Preview abstract We describe Shasta, a middleware system built at Google to support interactive reporting in complex user-facing applications related to Google’s Internet advertising business. Shasta targets applications with challenging requirements: First, user query latencies must be low. Second, underlying transactional data stores have complex “read-unfriendly” schemas, placing significant transformation logic between stored data and the read-only views that Shasta exposes to its clients. This transformation logic must be expressed in a way that scales to large and agile engineering teams. Finally, Shasta targets applications with strong data freshness requirements, making it challenging to precompute query results using common techniques such as ETL pipelines or materialized views. Instead, online queries must go all the way from primary storage to userfacing views, resulting in complex queries joining 50 or more tables. Designed as a layer on top of Google’s F1 RDBMS and Mesa data warehouse, Shasta combines language and system techniques to meet these requirements. To help with expressing complex view specifications, we developed a query language called RVL, with support for modularized view templates that can be dynamically compiled into SQL. To execute these SQL queries with low latency at scale, we leveraged and extended F1’s distributed query engine with facilities such as safe execution of C++ and Java UDFs. To reduce latency and increase read parallelism, we extended F1 storage with a distributed read-only in-memory cache. The system we describe is in production at Google, powering critical applications used by advertisers and internal sales teams. Shasta has significantly improved system scalability and software engineering efficiency compared to the middleware solutions it replaced. View details