Jump to Content

Datasets

In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.

Sort By
  • Year
  • Year, descending
1 - 15 of 158 datasets
    ScreenQA
    ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
    ScreenQA Short
    The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.
    Google Data Center Power Trace 2019
    Power utilization of power domains in Google data centers during 2019 May.
    SCIN Crowdsourced Dermatology Dataset
    The SCIN dataset contains 10,000 images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels, as well as estimated Fitzpatrick skin type and Monk Skin Tone.
    Screen Annotation
    The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations describe the UI elements present on the screen: their type, location, OCR text and a short description.
    LibriTTS-R
    LibriTTS-R is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.. To improve sound quality, a speech restoration model, 'Miipher' was used.
    MD3: Multi-dialect dataset of dialogues
    The MD3 dataset features audio and transcripts of thousands of conversational dialogues in English from India, Nigeria, and the United States. In each dialogue, speakers are prompted with an information-sharing intent, which is an image or phrase.
    AIS: Attributable to Identified Sources
    AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
    MusicCaps
    The MusiCaps dataset contains 5.5k high-quality music captions written by musicians. Each is describing a 10s clip of music from YouTube.
    DiffQG: Generating questions on paired sentences
    DiffQG is a dataset about summarizing the difference between two passages using a question and answer pair.
    QUEST
    QUEST is a dataset of 3357 natural language queries with implicit set operations, that map to a set of entities corresponding to Wikipedia documents.
    Voice Assistant Failures Dataset
    This is a dataset of 199 failures that 107 users have encountered when interacting with commercial voice assistants.
    Google Cloud Public Datasets
    The Google Cloud Public Datasets Program hosts copies of structured and unstructured data to make it easier for users to discover, access, and utilize public data in the cloud. These datasets are hosted for free.
    Upwelling irradiance from GOES-16
    Machine learned models that estimate wideband irradiance from 2km narrow-band radiances (using co-aligned satellite imagery as training data) and so can be used to make satellite-driven estimates of contrail warming.
    C4RepSet
    C4RepSet is a representative subset of C4 (Colossal Clean Crawled Corpus). It offers efficient training of large language models even though the size is significantly smaller than C4.