Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 167 publications
    Preview abstract Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices. View details
    Automatic Histograms: Leveraging Language Models for Text Dataset Exploration
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA (2024), pp. 9
    Preview abstract Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data practitioners often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific, e.g., instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data practitioners often run custom analyses for each dataset, which is cumbersome and difficult, or use unsupervised methods. We present AutoHistograms, a visualization tool leveraging LLMs. AutoHistograms automatically identifies relevant entity-based features, visualizes their distributions, and allows the user to interactively query the dataset for new categories of entities. In a user study with (n=10) data practitioners, we observe that participants were able to quickly onboard to AutoHistograms, use the tool to identify actionable insights, and conceptualize a broad range of applicable use cases. We also describe a variety of usage scenarios from different types of users to highlight how this app can provide value in many different contexts. Finally, we present a quantitative evaluation of the tool. Together, this tool and user study contribute to the growing field of LLM-assisted sensemaking tools. View details
    Progressive Partitioning for Parallelized Query Execution in Google’s Napa
    Junichi Tatemura
    Yanlai Huang
    Jim Chen
    Yupu Zhang
    Kevin Lai
    Divyakant Agrawal
    Brad Adelberg
    Shilpa Kolhar
    49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
    Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details
    Preview abstract Building on the simplicity and power of declarative queries combined with strongly consistent transactional semantics has allowed the Spanner database to scale to many thousands of machines running an aggregate of over 2 billion queries per second on over 8 exabytes of data. This includes some of the largest applications in the world, serving well over a billion users each. The appetite for database storage continues to grow, potentially reaching zettabyte scale (1 billion terrabytes) by 2030. However, the end of Moore and Dennard scaling mean that the cost of the infrastructure to run those databases could grow much faster than it has in the past. In this talk I will give my perspective on the challenges to reaching zettabyte scale, and the hardware technologies and approaches most (and least) likely to be successful. View details
    Are we cobblers without shoes? Making Computer Science data FAIR
    Carole Goble
    Communications of ACM, vol. 66 (1) (2023)
    Preview abstract n/a (a Viewpoint article) View details
    Data Commons
    Prashanth Radhakrishnan
    Bo Xu
    Carolyn Au
    Wei Sun
    Jehangir Amjad
    Ajai Tirumali
    Jennifer Chen
    Julia Wu
    Natalie Diaz
    Samantha Piekos
    Prem Ramaswami
    James Manyika
    (2023)
    Preview abstract Publicly available data from open sources (E.g., Census [1], BLS [2], WHO [3], IPCC [4]) are vital resources for policy makers, students and researchers across different disciplines. Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data. Our goal with Data Commons is to address this problem by doing this once and making the processed data widely available via standard schemas and Cloud APIs. Data Commons is a distributed network of sites that publish data in a common schema and interoperate using the Data Commons APIs. Data from different Data Commons can be ‘joined’ easily. The aggregate of these Data Commons can be viewed as a single Knowledge Graph. This paper describes the architecture of Data Commons, some of the major deployments and highlights directions for future work. View details
    Firestore: The NoSQL Serverless Database for the Application Developer
    Ram Kesavan
    David Gay
    Daniel Thevessen
    Jimit Shah
    C. Mohan
    2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 3367-3379
    Preview abstract The recent years have seen an explosive growth in web and mobile application development. Such applications typically have rapid development cycles, expect mobile-friendly features and serverless characteristics such as rapid deployment (with minimal provisioning) capabilities, scalability to handle workload spikes, and convenient pay-as-you-go billing. Google’s Firestore is a NoSQL serverless database with real-time notification capability, and together with the Firebase ecosystem it greatly simplifies common app development challenges while letting the application developer focus primarily on their business logic and user experience. This paper presents the Firestore architecture, how it satisfies the aforementioned requirements, and how its real-time notification system works in tandem with Firebase client libraries to allow mobile applications to provide a smooth user experience even across network connectivity issues. View details
    In-path Oracles for Road Networks
    Debajyoti Ghosh
    Kiran Khatter
    Hanan Samet
    International Journal of Geo-Information, vol. 12(7) (2023), pp. 277
    Preview abstract Many spatial applications benefit from the fast answering of a seemingly simple spatial query --- ``Is a point of interest (POI) `in-path’ to the shortest path between a source and a destination?’’ In-path in this case refers to POI that are either on the shortest path or can be reached within a bounded yet small detour of the shortest path. The fast answering of the in-path queries is contingent on being able to determine if a POI is in-path or not without having to compute the shortest paths during run-time. Thus, this requires a precomputation solution. The key technical solution is the development of an in-path oracle that is based on precomputation which records pairs of sources and destinations that are in-path with respect to the given POI location. For a given road network with $n$ nodes and $m$ POIs, a $O(m \times n)$-sized oracle is envisioned based on the reduction of Well-Separated pair decomposition of the road network. Furthermore, the oracle can be indexed in a database using a B-tree and hundreds of thousands of in-path queries per second can be answered. Experimental results on real road network POI dataset showcase the superiority of this technique compared to a suitable baseline. The proposed approach can answer 1.5 million in-path queries/second compared to the few hundreds per second with existing approaches. View details
    Detection and Prevention of Silent Data Corruption in an Exabyte-scale Database System
    The 18th IEEE Workshop on Silicon Errors in Logic – System Effects, IEEE (2022)
    Preview abstract Google’s Spanner database serves multiple exabytes of data at well over a billion queries per second, distributed over a significant fraction of Google’s fleet. Silent data corruption events due to hardware error are detected/prevented by Spanner several times per week. For every detected error there are some number of undetected errors that in rare (but not black swan) events cause corruption either transiently for reads or durably for writes, potentially violating the most fundamental contract that a database system makes with its users: to store and retrieve data with absolute reliability and availability. We describe the work we have done to detect and prevent silent data corruptions and (equally importantly) to remove faulty machines from the fleet, both manually and automatically. We present a simplified analytic model of corruption that provides some insights into the most effective ways to prevent end-user corruption events. We have made qualitative gains in detection and prevention of SDC events, but quantitative analysis remains difficult. We discuss various potential trajectories in hardware (un)reliability and how they will affect our ability to build reliable database systems on commodity hardware. View details
    The Open Reaction Database
    Abigail G. Doyle
    Connor W. Coley
    Joel M. Hawkins
    Klavs F. Jensen
    Michael R. Maser
    Michael Wleklinski
    Spencer D. Dreher
    (2021)
    Preview abstract Chemical reaction data in journal articles, patents, and even electronic laboratory notebooks are currently stored in various formats, often unstructured, which presents a significant barrier to downstream applications, including the training of machine learning models. We present the Open Reaction Database (ORD), an open access schema and infrastructure for structuring and sharing organic reaction data, including a centralized data repository. The ORD schema supports conventional and emerging technologies, from benchtop reactions to automated high-throughput experiments and flow chemistry. The data, schema, supporting code, and web-based user interfaces are all publicly available on GitHub. Our vision is that a consistent data representation and infrastructure to support data sharing will enable downstream applications that will greatly improve the state of the art with respect to computer-aided synthesis planning, reaction prediction, and other predictive chemistry tasks. View details
    Preview abstract The Covid Tracking Project was the most reliable source for COVID-19 data with race/ethnicity at the state level until it stopped collecting data on March 7, 2021. The CDC's Case Surveillance Restricted Access and Public Use with Geography datasets are the only available replacements for the Covid Tracking Project's dataset, and they additionally include county-level data and age along with race/ethnicity. This paper evaluates the completeness of the CDC datasets at the state and county levels in terms of (1) the total number of cases included compared to the New York Times, and (2) the number of cases included with race/ethnicity data compared to the Covid Tracking Project. The CDC's Restricted Access dataset contains 78% of the cases in the New York Times up to April 15, 2021, and 65% of cases have race/ethnicity information vs. 67% in the Covid Tracking Project. The dataset's completeness has steadily and gradually improved over time; e.g., the first available version from May 2020 had race/ethnicity information for only 43% of cases. At the state and county levels, the dataset's completeness has also improved with a state-level average of 62% of cases with race/ethnicity in April 2021 vs. 46% in June 2020. However, the dataset's completeness at the state level is highly variable; for example, Minnesota has 102% of the cases included in the New York Times, while Louisiana has only 4% of the cases in the New York Times. Minnesota has 91% of cases with race/ethnicity, while Louisiana has only 19% with race/ethnicity (vs. 94% in the Covid Tracking Project). Texas alone is missing 2.8M cases, accounting for more than a third of the total 7.1M missing cases. New York is missing race/ethnicity for 1.3M cases and California for 1.1M cases, accounting for more than a quarter of the 8.6M cases missing race/ethnicity when combined. The CDC's Public Use with Geography dataset is similar to the Restricted Access dataset for total case counts, but is less complete due to more privacy suppression; e.g., only 49% of cases have race/ethnicity information. View details
    Validating Data and Models in Continuous ML pipelines
    Evan Rosen
    Gene Huang
    Mike Dreves
    Neoklis Polyzotis
    Zhuo Peng
    IEEE TCDE (2021)
    Preview abstract Production ML is more than writing the code for the trainer. It requires processes and tooling that enable a larger team to share, track, analyze, and monitor not only the code for ML but also the artifacts (Datasets, Models, ...) that are manipulated and generated in these production ML pipelines. In this paper we describe the tools we developed at Google for the analysis and validation of two of the most important types of artifacts: Datasets and Models. These tools are currently deployed in production at Google and other large organizations. Our approach is heavily inspired by well-known principles of data-management systems. Ultimately, we want to enable users to trust their data and models, and understand how data properties affect the quality of the generated ML models. View details
    Preview abstract Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this. View details
    Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
    Kevin Lai
    Min Chen
    Jim Chen
    Ming Dai
    Thanh Do
    Haoyu Gao
    Haoyan Geng
    Raman Grover
    Bo Huang
    Yanlai Huang
    Adam Li
    Jianyi Liang
    Tao Lin
    Li Liu
    Yao Liu
    Xi Mao
    Maya Meng
    Prashant Mishra
    Jay Patel
    Vijayshankar Raman
    Sourashis Roy
    Mayank Singh Shishodia
    Tianhang Sun
    Justin Tang
    Junichi Tatemura
    Sagar Trehan
    Ramkumar Vadali
    Prasanna Venkatasubramanian
    Joey Zhang
    Kefei Zhang
    Yupu Zhang
    Zeleng Zhuang
    Divyakanth Agrawal
    Jeff Naughton
    Sujata Sunil Kosalge
    Hakan Hacıgümüş
    Proceedings of the VLDB Endowment (PVLDB), vol. 14 (12) (2021), pp. 2986-2998
    Preview abstract There are numerous Google services that continuously generate vast amounts of log data that are used to provide valuable insights to internal and external business users. We need to store and serve these planet-scale data sets under extremely demanding requirements of scalability, sub-second query response times, availability even in the case of entire data center failures, strong consistency guarantees, ingesting a massive stream of updates coming from the applications used around the globe. We have developed and deployed in production an analytical data management system, called Napa, to meet these requirements. Napa is the backend for multiple internal and external clients in Google so there is a strong expectation of variance-free robust query performance. At its core, Napa’s principal technologies for robust query performance include the aggressive use of materialized views that are maintained consistently as new data is ingested across multiple data centers. Our clients also demand flexibility in being able to adjust their query performance, data freshness, and costs to suit their unique needs. Robust query processing and flexible configuration of client databases are the hallmark of Napa design. Most of the related work in this area takes advantage of full flexibility to design the whole system without the need to support a diverse set of preexisting use cases, whereas Napa needs to deal with the hard constraints of applications that differ on which characteristics of the system are most important to optimize. Those constraints led us to make particular design decisions and also devise new techniques to meet the challenges. In this paper, we share our experiences in designing, implementing, deploying, and running Napa in production with some of Google’s most demanding applications. View details
    Dataset or Not? A study on the veracity of semantic markup for dataset pages
    Tarfah Alrashed
    Omar Benjelloun
    20th International Semantic Web Conference (ISWC 2021) (to appear)
    Preview abstract Semantic markup, such as Schema.org, allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google’s Dataset Search. Dataset Search relies on Schema.org to identify pages that describe datasets. While Schema.org was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide Schema.org/Dataset markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search’s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with Schema.org/Dataset markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users. View details