Data Management

Google is deeply engaged in Data Management research across a variety of topics with deep connections to Google products. We are building intelligent systems to discover, annotate, and explore structured data from the Web, and to surface them creatively through Google products, such as Search (e.g., structured snippets, Docs, and many others). The overarching goal is to create a plethora of structured data on the Web that maximally help Google users consume, interact and explore information. Through those projects, we study various cutting-edge data management research issues including information extraction and integration, large scale data analysis, effective data exploration, etc., using a variety of techniques, such as information retrieval, data mining and machine learning.

A major research effort involves the management of structured data within the enterprise. The goal is to discover, index, monitor, and organize this type of data in order to make it easier to access high-quality datasets. This type of data carries different, and often richer, semantics than structured data on the Web, which in turn raises new opportunities and technical challenges in their management.

Furthermore, Data Management research across Google allows us to build technologies that power Google's largest businesses through scalable, reliable, fast, and general-purpose infrastructure for large-scale data processing as a service. Some examples of such technologies include F1, the database serving our ads infrastructure; Mesa, a petabyte-scale analytic data warehousing system; and Dremel, for petabyte-scale data processing with interactive response times. Dremel is available for external customers to use as part of Google Cloud’s BigQuery.

Recent Publications

Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
Preview abstract Unifying query languages is key in reducing toil for app developers and end users to query and analyze observability data. A common query language that can leverage all observability data such as metrics, traces, profiles, events, logs to facilitate correlation, support trend analytics and provide end-to-end observability for AI applications. The Observability TAG QLS workgroup is finalizing a semantic query language spec in 2025 and is recommending SQL as a basis with further experimentation on syntaxes. This talk will explore the design principles, user research and challenges of creating a query language to support observability goals. It will delve into the core concepts, syntax, and semantics of SQL operators and its needed syntactic sugar, while addressing the unique requirements of observability data. It will also explore the trade-offs between simplicity, expressiveness, and performance. This query language convergence for end-to-end analytics could enhance reliability and operational efficiency for SREs and your app developers. A win-win for all. View details
Preview abstract The integration of vector search into databases, driven by advancements in embedding models, semantic search, and Retrieval-Augmented Generation (RAG), enables powerful combined querying of structured and unstructured data. This paper focuses on filtered vector search (FVS), a core operation where relational predicates restrict the dataset before or during the vector similarity search (top-k). While approximate near neighbor (ANN) indices are commonly used to accelerate vector search by trading latency for recall, the addition of filters complicates performance optimization and makes achieving stable, declarative recall guarantees challenging. Filters alter the effective dataset size and distribution, impacting the search effort required. We discuss the primary FVS execution strategies – pre-filtering, post-filtering, and inline-filtering – whose efficiencies depend on factors like filter selectivity, cardinality, and data correlation. We review existing approaches that modify index structures and search algorithms (e.g., iterative post-filtering, filter-aware index traversal) to enhance FVS performance. This tutorial provides a comprehensive overview of filtered vector search, discussing its use cases, classifying current solutions and their trade-offs, and highlighting crucial research challenges and future directions for developing efficient and accurate FVS systems.   View details
Preview abstract Traditionally, quality management relies on siloed systems of record such as quality management system (QMS), application lifecycle management (ALM), and manufacturing execution system (MES) platforms. These systems are often static, passive repositories that require significant manual effort to connect disparate data and derive actionable insights. Fragmentation and lack of proactive intelligence can lead to delays in identifying quality issues, ensuring compliance, and accelerating innovation. This disclosure describes a quality management framework to provide collaboration between human experts and specialized artificial intelligence (AI) agents for proactive and semi-autonomous quality management. The framework provides a distributed, intelligent ecosystem where a central AI engine can delegate specific, complex quality workflows to specialized AI agents that operate continually and autonomously, with a human-in-the-loop for final approval. The framework is built on a three-layer architecture that can be powered by a cloud computing platform. View details
SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL
Shannon Bales
Matthew Brown
Jean-Daniel Browne
Brandon Dolphin
Romit Kudtarkar
Andrey Litvinov
Jingchi Ma
John Morcos
Michael Shen
David Wilhite
Xi Wu
Lulan Yu
Proc. VLDB Endow. (2024), pp. 4051-4063 (to appear)
Preview abstract SQL has been extremely successful as the de facto standard language for working with data. Virtually all mainstream database-like systems use SQL as their primary query language. But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend. Many have observed these challenges with SQL, and proposed solutions involving new languages. New language adoption is a significant obstacle for users, and none of the potential replacements have been successful enough to displace SQL. In GoogleSQL, we’ve taken a different approach - solving SQL’s problems by extending SQL. Inspired by a pattern that works well in other modern data languages, we added piped data flow syntax to SQL. The results are transformative - SQL becomes a flexible language that’s easier to learn, use and extend, while still leveraging the existing SQL ecosystem and existing userbase. Improving SQL from within allows incrementally adopting new features, without migrations and without learning a new language, making this a more productive approach to improve on standard SQL. View details
Preview abstract Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data. View details
×