Datasets
In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
Sort By
1 - 15 of 162 datasets
ScreenQA
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
Adversarial Nibbler Round 1 Dataset
This dataset contains results from round 1 of Adversarial Nibbler challenge. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. It also includes: all prompts submitted and all prompts attempted.
Screen Annotation
The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations describe the UI elements present on the screen: their type, location, OCR text and a short description.
CF-TriviaQA
The CF-TriviaQA dataset accompanies "Hallucination Augmented Recitations for Language Models" paper (https://arxiv.org/abs/2311.07424). It is a counterfactual open book QA dataset generated from the TriviaQA dataset using HAR approach, with the purpose of improving attribution in LLMs.
SCIN Crowdsourced Dermatology Dataset
The SCIN dataset contains 10,000 images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels, as well as estimated Fitzpatrick skin type and Monk Skin Tone.
BamTwoogle
The BamTwoogle dataset accompanies "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent" paper (https://arxiv.org/abs/2312.10003). It was written to be a complementary, slightly more challenging sequel to Bamboogle dataset.
Google Data Center Power Trace 2019
Power utilization of power domains in Google data centers during 2019 May.
ScreenQA Short
The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.
LibriTTS-R
LibriTTS-R is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.. To improve sound quality, a speech restoration model, 'Miipher' was used.
MD3: Multi-dialect dataset of dialogues
The MD3 dataset features audio and transcripts of thousands of conversational dialogues in English from India, Nigeria, and the United States. In each dialogue, speakers are prompted with an information-sharing intent, which is an image or phrase.
AIS: Attributable to Identified Sources
AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
Voice Assistant Failures Dataset
This is a dataset of 199 failures that 107 users have encountered when interacting with commercial voice assistants.
MusicCaps
The MusiCaps dataset contains 5.5k high-quality music captions written by musicians. Each is describing a 10s clip of music from YouTube.
Upwelling irradiance from GOES-16
Machine learned models that estimate wideband irradiance from 2km narrow-band radiances (using co-aligned satellite imagery as training data) and so can be used to make satellite-driven estimates of contrail warming.
DiffQG: Generating questions on paired sentences
DiffQG is a dataset about summarizing the difference between two passages using a question and answer pair.