Jump to Content


In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.

Sort By
  • Year
  • Year, descending
1 - 15 of 159 datasets
    ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
    Google Data Center Power Trace 2019
    Power utilization of power domains in Google data centers during 2019 May.
    Screen Annotation
    The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations describe the UI elements present on the screen: their type, location, OCR text and a short description.
    Adversarial Nibbler Round 1 Dataset
    This dataset contains results from round 1 of Adversarial Nibbler challenge. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. It also includes: all prompts submitted and all prompts attempted.
    SCIN Crowdsourced Dermatology Dataset
    The SCIN dataset contains 10,000 images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels, as well as estimated Fitzpatrick skin type and Monk Skin Tone.
    ScreenQA Short
    The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.
    LibriTTS-R is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.. To improve sound quality, a speech restoration model, 'Miipher' was used.
    MD3: Multi-dialect dataset of dialogues
    The MD3 dataset features audio and transcripts of thousands of conversational dialogues in English from India, Nigeria, and the United States. In each dialogue, speakers are prompted with an information-sharing intent, which is an image or phrase.
    AIS: Attributable to Identified Sources
    AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
    The MusiCaps dataset contains 5.5k high-quality music captions written by musicians. Each is describing a 10s clip of music from YouTube.
    DiffQG: Generating questions on paired sentences
    DiffQG is a dataset about summarizing the difference between two passages using a question and answer pair.
    Voice Assistant Failures Dataset
    This is a dataset of 199 failures that 107 users have encountered when interacting with commercial voice assistants.
    QUEST is a dataset of 3357 natural language queries with implicit set operations, that map to a set of entities corresponding to Wikipedia documents.
    Upwelling irradiance from GOES-16
    Machine learned models that estimate wideband irradiance from 2km narrow-band radiances (using co-aligned satellite imagery as training data) and so can be used to make satellite-driven estimates of contrail warming.
    Google Cloud Public Datasets
    The Google Cloud Public Datasets Program hosts copies of structured and unstructured data to make it easier for users to discover, access, and utilize public data in the cloud. These datasets are hosted for free.