Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

people standing in front of a screen with images and a chipboard

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 333 publications
    Preview abstract Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices. View details
    Automatic Histograms: Leveraging Language Models for Text Dataset Exploration
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA(2024), pp. 9
    Preview abstract Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data practitioners often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific, e.g., instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data practitioners often run custom analyses for each dataset, which is cumbersome and difficult, or use unsupervised methods. We present AutoHistograms, a visualization tool leveraging LLMs. AutoHistograms automatically identifies relevant entity-based features, visualizes their distributions, and allows the user to interactively query the dataset for new categories of entities. In a user study with (n=10) data practitioners, we observe that participants were able to quickly onboard to AutoHistograms, use the tool to identify actionable insights, and conceptualize a broad range of applicable use cases. We also describe a variety of usage scenarios from different types of users to highlight how this app can provide value in many different contexts. Finally, we present a quantitative evaluation of the tool. Together, this tool and user study contribute to the growing field of LLM-assisted sensemaking tools. View details
    BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
    Garrett Casto
    Mingge Deng
    Rushabh Desai
    Thibaud Hottelier
    Amir Hormati
    Jeff Johnson
    Dawid Kurzyniec
    Prem Ramanathan
    Gaurav Saxena
    Vidya Shanmugam
    Yuri Volobuev
    SIGMOD(2024)
    Preview abstract BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default. This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data. View details
    Preview abstract Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data. View details
    Preview abstract With the rise of open data in the last two decades, more datasets are online and more people are using them for projects and research. But how do people find datasets? We present the first user study of Google Dataset Search, a dataset-discovery tool that uses a web crawl and open ecosystem to find datasets. Google Dataset Search contains a superset of the datasets in other dataset-discovery tools—a total of 45 million datasets from 13,000 sources. We found that the tool addresses a previously identified need: a search engine for datasets across the entire web, including datasets in other tools. However, the tool introduced new challenges due to its open approach: building a mental model of the tool, making sense of heterogeneous datasets, and learning how to search for datasets. We discuss recommendations for dataset-discovery tools and open research questions. View details
    Progressive Partitioning for Parallelized Query Execution in Google’s Napa
    Junichi Tatemura
    Yanlai Huang
    Jim Chen
    Yupu Zhang
    Kevin Lai
    Hao Zhang
    Goetz Graefe
    Divyakant Agrawal
    Brad Adelberg
    Shilpa Kolhar
    49th International Conference on Very Large Data Bases, VLDB(2023), pp. 3475-3487
    Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details
    Preview abstract Building on the simplicity and power of declarative queries combined with strongly consistent transactional semantics has allowed the Spanner database to scale to many thousands of machines running an aggregate of over 2 billion queries per second on over 8 exabytes of data. This includes some of the largest applications in the world, serving well over a billion users each. The appetite for database storage continues to grow, potentially reaching zettabyte scale (1 billion terrabytes) by 2030. However, the end of Moore and Dennard scaling mean that the cost of the infrastructure to run those databases could grow much faster than it has in the past. In this talk I will give my perspective on the challenges to reaching zettabyte scale, and the hardware technologies and approaches most (and least) likely to be successful. View details
    Preview abstract n/a (a Viewpoint article) View details
    Data Commons
    Prashanth Radhakrishnan
    Bo Xu
    Carolyn Au
    Wei Sun
    Jehangir Amjad
    Ajai Tirumali
    Jennifer Chen
    Julia Wu
    Natalie Diaz
    Samantha Piekos
    Prem Ramaswami
    James Manyika
    (2023)
    Preview abstract Publicly available data from open sources (E.g., Census [1], BLS [2], WHO [3], IPCC [4]) are vital resources for policy makers, students and researchers across different disciplines. Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data. Our goal with Data Commons is to address this problem by doing this once and making the processed data widely available via standard schemas and Cloud APIs. Data Commons is a distributed network of sites that publish data in a common schema and interoperate using the Data Commons APIs. Data from different Data Commons can be ‘joined’ easily. The aggregate of these Data Commons can be viewed as a single Knowledge Graph. This paper describes the architecture of Data Commons, some of the major deployments and highlights directions for future work. View details
    Firestore: The NoSQL Serverless Database for the Application Developer
    Ram Kesavan
    David Gay
    Daniel Thevessen
    Jimit Shah
    C. Mohan
    2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 3367-3379
    Preview abstract The recent years have seen an explosive growth in web and mobile application development. Such applications typically have rapid development cycles, expect mobile-friendly features and serverless characteristics such as rapid deployment (with minimal provisioning) capabilities, scalability to handle workload spikes, and convenient pay-as-you-go billing. Google’s Firestore is a NoSQL serverless database with real-time notification capability, and together with the Firebase ecosystem it greatly simplifies common app development challenges while letting the application developer focus primarily on their business logic and user experience. This paper presents the Firestore architecture, how it satisfies the aforementioned requirements, and how its real-time notification system works in tandem with Firebase client libraries to allow mobile applications to provide a smooth user experience even across network connectivity issues. View details
    In-path Oracles for Road Networks
    Debajyoti Ghosh
    Kiran Khatter
    Hanan Samet
    International Journal of Geo-Information, 12(7)(2023), pp. 277
    Preview abstract Many spatial applications benefit from the fast answering of a seemingly simple spatial query --- ``Is a point of interest (POI) `in-path’ to the shortest path between a source and a destination?’’ In-path in this case refers to POI that are either on the shortest path or can be reached within a bounded yet small detour of the shortest path. The fast answering of the in-path queries is contingent on being able to determine if a POI is in-path or not without having to compute the shortest paths during run-time. Thus, this requires a precomputation solution. The key technical solution is the development of an in-path oracle that is based on precomputation which records pairs of sources and destinations that are in-path with respect to the given POI location. For a given road network with $n$ nodes and $m$ POIs, a $O(m \times n)$-sized oracle is envisioned based on the reduction of Well-Separated pair decomposition of the road network. Furthermore, the oracle can be indexed in a database using a B-tree and hundreds of thousands of in-path queries per second can be answered. Experimental results on real road network POI dataset showcase the superiority of this technique compared to a suitable baseline. The proposed approach can answer 1.5 million in-path queries/second compared to the few hundreds per second with existing approaches. View details
    Detection and Prevention of Silent Data Corruption in an Exabyte-scale Database System
    The 18th IEEE Workshop on Silicon Errors in Logic – System Effects, IEEE(2022)
    Preview abstract Google’s Spanner database serves multiple exabytes of data at well over a billion queries per second, distributed over a significant fraction of Google’s fleet. Silent data corruption events due to hardware error are detected/prevented by Spanner several times per week. For every detected error there are some number of undetected errors that in rare (but not black swan) events cause corruption either transiently for reads or durably for writes, potentially violating the most fundamental contract that a database system makes with its users: to store and retrieve data with absolute reliability and availability. We describe the work we have done to detect and prevent silent data corruptions and (equally importantly) to remove faulty machines from the fleet, both manually and automatically. We present a simplified analytic model of corruption that provides some insights into the most effective ways to prevent end-user corruption events. We have made qualitative gains in detection and prevention of SDC events, but quantitative analysis remains difficult. We discuss various potential trajectories in hardware (un)reliability and how they will affect our ability to build reliable database systems on commodity hardware. View details
    The Open Reaction Database
    Abigail G. Doyle
    Connor W. Coley
    Joel M. Hawkins
    Klavs F. Jensen
    Michael R. Maser
    Michael Wleklinski
    Spencer D. Dreher
    (2021)
    Preview abstract Chemical reaction data in journal articles, patents, and even electronic laboratory notebooks are currently stored in various formats, often unstructured, which presents a significant barrier to downstream applications, including the training of machine learning models. We present the Open Reaction Database (ORD), an open access schema and infrastructure for structuring and sharing organic reaction data, including a centralized data repository. The ORD schema supports conventional and emerging technologies, from benchtop reactions to automated high-throughput experiments and flow chemistry. The data, schema, supporting code, and web-based user interfaces are all publicly available on GitHub. Our vision is that a consistent data representation and infrastructure to support data sharing will enable downstream applications that will greatly improve the state of the art with respect to computer-aided synthesis planning, reaction prediction, and other predictive chemistry tasks. View details
    Preview abstract The Covid Tracking Project was the most reliable source for COVID-19 data with race/ethnicity at the state level until it stopped collecting data on March 7, 2021. The CDC's Case Surveillance Restricted Access and Public Use with Geography datasets are the only available replacements for the Covid Tracking Project's dataset, and they additionally include county-level data and age along with race/ethnicity. This paper evaluates the completeness of the CDC datasets at the state and county levels in terms of (1) the total number of cases included compared to the New York Times, and (2) the number of cases included with race/ethnicity data compared to the Covid Tracking Project. The CDC's Restricted Access dataset contains 78% of the cases in the New York Times up to April 15, 2021, and 65% of cases have race/ethnicity information vs. 67% in the Covid Tracking Project. The dataset's completeness has steadily and gradually improved over time; e.g., the first available version from May 2020 had race/ethnicity information for only 43% of cases. At the state and county levels, the dataset's completeness has also improved with a state-level average of 62% of cases with race/ethnicity in April 2021 vs. 46% in June 2020. However, the dataset's completeness at the state level is highly variable; for example, Minnesota has 102% of the cases included in the New York Times, while Louisiana has only 4% of the cases in the New York Times. Minnesota has 91% of cases with race/ethnicity, while Louisiana has only 19% with race/ethnicity (vs. 94% in the Covid Tracking Project). Texas alone is missing 2.8M cases, accounting for more than a third of the total 7.1M missing cases. New York is missing race/ethnicity for 1.3M cases and California for 1.1M cases, accounting for more than a quarter of the 8.6M cases missing race/ethnicity when combined. The CDC's Public Use with Geography dataset is similar to the Restricted Access dataset for total case counts, but is less complete due to more privacy suppression; e.g., only 49% of cases have race/ethnicity information. View details
    Validating Data and Models in Continuous ML pipelines
    Evan Rosen
    Gene Huang
    Mike Dreves
    Neoklis Polyzotis
    Zhuo Peng
    IEEE TCDE(2021)
    Preview abstract Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only the code for ML but also the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models. View details