Jump to Content

Data Management

Google is deeply engaged in Data Management research across a variety of topics with deep connections to Google products. We are building intelligent systems to discover, annotate, and explore structured data from the Web, and to surface them creatively through Google products, such as Search (e.g., structured snippets, Docs, and many others). The overarching goal is to create a plethora of structured data on the Web that maximally help Google users consume, interact and explore information. Through those projects, we study various cutting-edge data management research issues including information extraction and integration, large scale data analysis, effective data exploration, etc., using a variety of techniques, such as information retrieval, data mining and machine learning.

A major research effort involves the management of structured data within the enterprise. The goal is to discover, index, monitor, and organize this type of data in order to make it easier to access high-quality datasets. This type of data carries different, and often richer, semantics than structured data on the Web, which in turn raises new opportunities and technical challenges in their management.

Furthermore, Data Management research across Google allows us to build technologies that power Google's largest businesses through scalable, reliable, fast, and general-purpose infrastructure for large-scale data processing as a service. Some examples of such technologies include F1, the database serving our ads infrastructure; Mesa, a petabyte-scale analytic data warehousing system; and Dremel, for petabyte-scale data processing with interactive response times. Dremel is available for external customers to use as part of Google Cloud’s BigQuery.

Recent Publications

Preview abstract Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices. View details
Firestore: The NoSQL Serverless Database for the Application Developer
Ram Kesavan
David Gay
Daniel Thevessen
Jimit Shah
C. Mohan
2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 3367-3379
Preview abstract The recent years have seen an explosive growth in web and mobile application development. Such applications typically have rapid development cycles, expect mobile-friendly features and serverless characteristics such as rapid deployment (with minimal provisioning) capabilities, scalability to handle workload spikes, and convenient pay-as-you-go billing. Google’s Firestore is a NoSQL serverless database with real-time notification capability, and together with the Firebase ecosystem it greatly simplifies common app development challenges while letting the application developer focus primarily on their business logic and user experience. This paper presents the Firestore architecture, how it satisfies the aforementioned requirements, and how its real-time notification system works in tandem with Firebase client libraries to allow mobile applications to provide a smooth user experience even across network connectivity issues. View details
Are we cobblers without shoes? Making Computer Science data FAIR
Carole Goble
Communications of ACM, vol. 66 (1) (2023)
Preview abstract n/a (a Viewpoint article) View details
Preview abstract Building on the simplicity and power of declarative queries combined with strongly consistent transactional semantics has allowed the Spanner database to scale to many thousands of machines running an aggregate of over 2 billion queries per second on over 8 exabytes of data. This includes some of the largest applications in the world, serving well over a billion users each. The appetite for database storage continues to grow, potentially reaching zettabyte scale (1 billion terrabytes) by 2030. However, the end of Moore and Dennard scaling mean that the cost of the infrastructure to run those databases could grow much faster than it has in the past. In this talk I will give my perspective on the challenges to reaching zettabyte scale, and the hardware technologies and approaches most (and least) likely to be successful. View details
In-path Oracles for Road Networks
Debajyoti Ghosh
Kiran Khatter
Hanan Samet
International Journal of Geo-Information, vol. 12(7) (2023), pp. 277
Preview abstract Many spatial applications benefit from the fast answering of a seemingly simple spatial query --- ``Is a point of interest (POI) `in-path’ to the shortest path between a source and a destination?’’ In-path in this case refers to POI that are either on the shortest path or can be reached within a bounded yet small detour of the shortest path. The fast answering of the in-path queries is contingent on being able to determine if a POI is in-path or not without having to compute the shortest paths during run-time. Thus, this requires a precomputation solution. The key technical solution is the development of an in-path oracle that is based on precomputation which records pairs of sources and destinations that are in-path with respect to the given POI location. For a given road network with $n$ nodes and $m$ POIs, a $O(m \times n)$-sized oracle is envisioned based on the reduction of Well-Separated pair decomposition of the road network. Furthermore, the oracle can be indexed in a database using a B-tree and hundreds of thousands of in-path queries per second can be answered. Experimental results on real road network POI dataset showcase the superiority of this technique compared to a suitable baseline. The proposed approach can answer 1.5 million in-path queries/second compared to the few hundreds per second with existing approaches. View details
Progressive Partitioning for Parallelized Query Execution in Google’s Napa
Junichi Tatemura
Yanlai Huang
Jim Chen
Yupu Zhang
Kevin Lai
Divyakant Agrawal
Brad Adelberg
Shilpa Kolhar
49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details