Jagan Sankaranarayanan
I work at Google on their data infrastructure team. In my previous life, I was the department head of the data management group at NEC Labs America. I got my doctoral degree in Computer Science from the University of Maryland in 2008 where my doctoral advisor was Prof. Hanan Samet.
Research Areas
Authored Publications
Sort By
Cross-lingual text clustering in a large system
Nicole R. Schneider
Hanan Samet
2023 7th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2023) (2023) (to appear)
Preview abstract
The multilingual world needs systems that can cluster text written
in multiple languages into the same thread or topic. A practical
approach for clustering text in different languages is to first translate
into a common language, such as English, and then cluster it post-
translation. While this approach seems sensible, the performance
and pitfalls of this approach have not been well studied. The
reference architecture used for the study is a news system that
has continuously indexed news articles over many years in over
19 languages. Through the analysis of these documents and their
clusters, the clustering quality is shown to be dependent on the
translator’s ability to normalize proper noun usage, the geographic
focus of the text, and the document topic.
View details
Progressive Partitioning for Parallelized Query Execution in Google’s Napa
Junichi Tatemura
Yanlai Huang
Jim Chen
Yupu Zhang
Kevin Lai
Divyakant Agrawal
Brad Adelberg
Shilpa Kolhar
Indrajit Roy
49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
Preview abstract
Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets.
Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis.
The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions.
View details
In-path Oracles for Road Networks
Debajyoti Ghosh
Kiran Khatter
Hanan Samet
International Journal of Geo-Information, 12(7) (2023), pp. 277
Preview abstract
Many spatial applications benefit from the fast answering of a seemingly simple spatial query --- ``Is a point of interest (POI) `in-path’ to the shortest path between a source and a destination?’’ In-path in this case refers to POI that are either on the shortest path or can be reached within a bounded yet small detour of the shortest path. The fast answering of the in-path queries is contingent on being able to determine if a POI is in-path or not without having to compute the shortest paths during run-time. Thus, this requires a precomputation solution.
The key technical solution is the development of an in-path oracle that is based on precomputation which records pairs of sources and destinations that are in-path with respect to the given POI location. For a given road network with $n$ nodes and $m$ POIs, a $O(m \times n)$-sized oracle is envisioned based on the reduction of Well-Separated pair decomposition of the road network. Furthermore, the oracle can be indexed in a database using a B-tree and hundreds of thousands of in-path queries per second can be answered. Experimental results on real road network POI dataset showcase the superiority of this technique compared to a suitable baseline. The proposed approach can answer 1.5 million in-path queries/second compared to the few hundreds per second with existing approaches.
View details
Preview abstract
In the new “gig” economy, a user plays the role of a consumer as well as a service provider. As a service provider, drivers travelling from a source to a destination may opportunistically pickup and drop-off packages along the way if that does not add significantly to their trip distance or time. This gives rise to a new business offering called Package Delivery as a Service (PDaaS) that brokers package pickups and deliveries at one end and connects them to drivers on the other end, thus creating an ecosystem of supply and demand. The dramatic cost savings of such a service model come from the fact that the driver is already en-route to their destination and the package delivery adds a small overhead to an already pre-planned trip. From a technical perspective, this problem introduces new technical challenges that are uncommon in the literature. The driver may want to optimise for distance or time. Furthermore, new packages arrive for delivery all the time and are assigned to various drivers continuously. This means that the algorithm has to work in an environment that is dynamic, thereby precluding most standard road network precomputation efforts. Furthermore, the number of packages that are available for delivery could be in the hundreds of thousands, which has to be quickly pruned down for the algorithm to scale. The paper proposes a variation called dual Dijkstra’s that combines a forward and a backward scan in order to find delivery options that satisfy the constraints specified by the driver. The new dual heuristic improves the standard single Dijkstra’s approach by narrowing down the search space, thus resulting in significant speed-ups over the standard algorithms. Furthermore, a combination of dual Dijkstra’s with a heuristic landmark approach results in a dramatic speed-up compared to the baseline algorithms. Experimental results show that the proposed approach can offer drivers a choice of packages to deliver under specified constraints of time or distance, and providing sub-second response time despite the complexity of the problem involved. As the number of packages in the system increases, the matchmaking process becomes easier resulting in faster response times. The scalability of the PDaaS infrastructure is demonstrated using extensive experimental results.
View details
Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
Kevin Lai
Indrajit Roy
Min Chen
Jim Chen
Ming Dai
Thanh Do
Haoyu Gao
Haoyan Geng
Raman Grover
Bo Huang
Yanlai Huang
Adam Li
Jianyi Liang
Tao Lin
Li Liu
Yao Liu
Xi Mao
Maya Meng
Prashant Mishra
Jay Patel
Vijayshankar Raman
Sourashis Roy
Mayank Singh Shishodia
Tianhang Sun
Justin Tang
Junichi Tatemura
Sagar Trehan
Ramkumar Vadali
Prasanna Venkatasubramanian
Joey Zhang
Kefei Zhang
Yupu Zhang
Zeleng Zhuang
Divyakanth Agrawal
Jeff Naughton
Sujata Sunil Kosalge
Hakan Hacıgümüş
Proceedings of the VLDB Endowment (PVLDB), 14 (12) (2021), pp. 2986-2998
Preview abstract
There are numerous Google services that continuously generate vast amounts of log data that are used to provide valuable insights to internal and external business users. We need to store and serve these planet-scale data sets under extremely demanding requirements of scalability, sub-second query response times, availability even in the case of entire data center failures, strong consistency guarantees, ingesting a massive stream of updates coming from the applications used around the globe. We have developed and deployed in production an analytical data management system, called Napa, to meet these requirements. Napa is the backend for multiple internal and external clients in Google so there is a strong expectation of variance-free robust query performance. At its core, Napa’s principal technologies for robust query performance include the aggressive use of materialized views that are maintained consistently as new data is ingested across multiple data centers. Our clients also demand flexibility in being able to adjust their query performance, data freshness, and costs to suit their unique needs. Robust query processing and flexible configuration of client databases are the hallmark of Napa design. Most of the related work in this area takes advantage of full flexibility to design the whole system without the need to support a diverse set of preexisting use cases, whereas Napa needs to deal with the hard constraints of applications that differ on which characteristics of the system are most important to optimize. Those constraints led us to make particular design decisions and also devise new techniques to meet the challenges. In this paper, we share our experiences in designing, implementing, deploying, and running Napa in production with some of Google’s most demanding applications.
View details