Jump to Content

Datasets

In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.

Browse our datasets

Search for datasets on the web with Dataset Search.

11 Billion Clues in 800 Million Documents; ClueWeb12
We took the ClueWeb corpora and automatically labeled concepts and entities with Freebase concept IDs, an example of entity resolution. This dataset is huge: nearly 800 million web pages. 50,000 Lessons on How to Read: a Relation Extraction Corpus
50,000 Lessons on How to Read: a Relation Extraction Corpus
A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution."
AIS: Attributable to Identified Sources
AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
Android smartphones high accuracy GNSS datasets
This dataset contains raw GNSS measurements collected from Android smartphones, and their precise ground truth trajectories. With carrier phase and dual frequency measurements, this dataset aims to facilitate the research and development of sub-meter smartphone positioning.
Argentinian Spanish [es-ar] multi-speaker speech
Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers.
Attributed QA
Attributed Question Answering (QA) as a key first step in the development of attributed LLMs. This release consists of human-rated system outputs for Attributed Question Answering.
AudioSet
The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos.
Auto-Arborist
The Auto Arborist dataset is a multiview fine-grained visual categorization dataset that contains over 2 million trees belonging to over 300 genus-level categories in 23 cities across the US and Canada built to foster the development of robust methods for large-scale urban forest monitoring.
AutoFlow
We present AutoFlow, a simple and effective method to render training data for optical flow that optimizes the performance of a model on a target dataset. Experimental results show that AutoFlow achieves state-of-the-art accuracy in pre-training both PWC-Net and RAFT.
AVA Dataset
Spatio-temporal annotations of human actions in movies, suitable for training localized action recognition systems.
Basque [eu-es] multi-speaker speech
Speech dataset containing about 7,100 transcribed high-quality audio of Basque [eu-es] sentences recorded by volunteers.
BC-Z Demonstration Dataset
Episodes of a robotic arm performing 100 different manipulation tasks. Data for each episode includes the RGB video, the robot's end-effector positions, and the natural language embedding. Episodes were gathered using teleoperation via a VR controller.
Bengali [bn-bd] ASR
Speech dataset containing about 218,000 transcribed audio of Bangladesh Bengali [bn-bd] sentences recorded by volunteers.
Bengali [bn-bd/bn-in] multi-speaker speech dataset
Speech dataset containing about 1,850 transcribed high-quality audio of Bangladesh Bengali [bn-bd] sentences recorded by volunteers, and about 1350 transcribed high-quality Indian Bengali [bn-in] recorded by volunteers.
Bike Video Dataset
This dataset contains 91,866 frames extracted from 11 videos. The videos were recorded using a hand-held phone camera while riding a bicycle. The dataset was primarily created to test the ability of self-supervised depth-learning methods to learn from videos with complex and non-smooth ego-motion.
Burmese [my-my] multi-speaker speech dataset
Speech dataset containing about 2,500 transcribed high-quality audio of Burmese [my-mm] sentences recorded by volunteers.
C4_200M Synthetic Dataset for Grammatical Error Correction
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model.
C4RepSet
C4RepSet is a representative subset of C4 (Colossal Clean Crawled Corpus). It offers efficient training of large language models even though the size is significantly smaller than C4.
Cartoon Set
Cartoon Set is a collection of random, 2D cartoon avatar images. The cartoons vary in 10 artwork categories, 4 color categories, and 4 proportion categories, with a total of ~1013 possible combinations. We provide sets of 100k randomly chosen cartoons and labeled attributes.
Catalan [ca-es] multi-speaker speech dataset
Speech dataset containing about 4,200 transcribed high-quality audio of Catalan [ca-es] sentences recorded by volunteers.
Chilean Spanish [es-cl] multi-speaker speech
Speech dataset containing about 4,350 transcribed high-quality audio of Chilean Spanish [es-cl] sentences recorded by volunteers.
Chrome User Experience Report
The Chrome User Experience Report (also known as the Chrome UX Report, or CrUX for short) is a dataset that reflects how real-world Chrome users experience popular destinations on the web. CrUX is the official dataset of the Web Vitals program.
Circa - Indirect yes/no answers in dialog
Circa is a dataset for problem of interpreting indirect answers to polar (yes/no) questions. It contains 34,269 pairs of yes/no questions and indirect answers, together with the interpretation of the answer. Eg. "Are you vegan?" "I love steak too much." [Interpretation=No]
CLSE: Corpus of Linguistically Significant Entities
A dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.
Coached Conversational Preference Elicitation
Wizard-of-Oz preference elicitation conversations in English between a user and an assistan about movie preferences, with annotated preference statements.
CocoChorales
CocoChorales consists of over 1400 hours of audio mixtures containing four-part chorales performed by 13 instruments, all synthesized with realistic-sounding generative models. CocoChorales contains mixes, sources, and MIDI data, as well as annotations for note expression and synthesis parameters.
Colombian Spanish [es-co] multi-speaker speech
Speech dataset containing about 4,900 transcribed high-quality audio of Colombian Spanish [es-co] sentences recorded by volunteers.
Conceptual Captions
A dataset consisting of ~3.3M images annotated with captions harvested from the web, representing a wide variety of styles.
Contrail attributions
A number of academic and industry groups have produced methods for predicting which flights should be diverted to mitigate contrails but these predictions have not been assessed to see how well they actually perform. This validation dataset allows such an assessment to be performed.
Conversational English audio annotations
This dataset was used to test the performance of our Audio De-id pipeline in our NAACL 2019 paper 'Audio De-identification: A New Entity Recognition Task'. We evaluated our pipeline using a random subset of conversations from the Switchboard and Fisher datasets.
COVID-19 Open Data
This repository contains datasets of daily time-series data related to COVID-19 for 50+ countries around the world.
COVID-19 Vaccination Search Insights
The dataset can help public health stakeholders explore vaccine-related concerns and the information needs of communities. It shows aggregate and anonymized trends representing the relative search interests across multiple search categories at the county and postal code level and updated weekly.
Crossmodal-3600
Crossmodal-3600 is a geographically diverse dataset of 3600 images each of them annotated with human-generated reference captions in 36 languages.
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English.
Dataset Metadata for CORD-19
Paper--dataset pairs for datasets mentioned or referenced in CORD-19 papers, an open research datasets of papers relevant for COVID-19. Specifically, the content contributes the metadata for these datasets collected from their descriptions in schema.org across data repositories on the Web.
Dataset Search: metadata for datasets
Dataset Search collects the metadata from schema.org markup on data provider pages. We then reconcile, clean and aggregate this information to show you the search results in Dataset Search. In this subset of the the corpus, we include metadata for datasets that have DOIs or compact identifiers.
Demographic Traits Annotations in Clinical Notes
The data contains sentence tagging for MIMIC-III and I2b2 2006 datasets that was used in the paper ‘Interactive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes’. Every sentence is tagged with its own demographic trait tag (as defined in the "Annotations Guide" file).
Dictionaries for linking Text, Entities, and Ideas
Database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts. The concepts are Wikipedia article; the strings are anchor text spans that link to the concepts.
DiffQG: Generating questions on paired sentences
DiffQG is a dataset about summarizing the difference between two passages using a question and answer pair.
DiscoFuse
A dataset of 60 million examples for training sentence fusion models. The data has been collected from Wikipedia and from Sports articles.
Disfl-QA
~12k contextual disfluent questions based n SQuAD-v2.
Document similarity triplets data
This is a dataset for evaluating document similarity models. In each file, each line consists of a triplet of URLs, either all from Wikipedia or all from arXiv.org. The content of URLs one and two should be more similar than the content of URLs two and three.
DQN Replay dataset
An offline RL dataset on Atari 2600 games based on the logged replay data of a DQN agent comprising 50 million (observation, action, reward, next observation) tuples per game.
EditBench
EditBench is a comprehensive diagnostic and evaluation dataset for text-guided image editing.
English Syntactic Ngrams
This corpus contains dependency tree fragments from automatically parsed English text. The dependency trees follow the Stanford basic-dependencies scheme. Each syntactic-ngram is accompanied with a corpus-level occurrence count, as well as a time-series of counts over the years
ETA Exploration Traces
ETA (Exploratory Testing Architecture) is a testing framework that explores the execution of a distributed application, looking for bugs that are provoked by particular sequences of events caused by non-determinism such as timing and asynchrony.
Evoked Expressions in Video
The Evoked Expressions in Video dataset contains videos paired with the expected facial expressions over time exhibited by people reacting to the video content.
Features Extracted From YouTube Videos for Multiview Learning
Multiple feature families from a set of public YouTube videos of games. The videos are labeled with one of 30 categories, and each has an associated set of visual, auditory, and and textual features.
Few-shot Regional Machine Translation
FRMT is a few-shot evaluation dataset containing en-pt and en-zh bitexts translated from Wikipedia, in two regional varieties for each non-English language (pt-BR and pt-PT; zh-CN and zh-TW). Sentences are grouped into three buckets designed to measure different aspects of controllable translation.
Galician [gl-es] multi-speaker speech
Speech dataset containing about 5,550 transcribed high-quality audio of Galician [gl-es] sentences recorded by volunteers.
Gap-Coreference
GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practical applications.
GoEmotions
GoEmotions is a human-annotated dataset of 58k Reddit comments. It is labeled with 27 emotion categories (12 positive, 11 negative, 4 ambiguous, and “neutral”), making it widely suitable for conversation understanding tasks that require a subtle differentiation between emotion expressions.
Google Cloud Public Datasets
The Google Cloud Public Datasets Program hosts copies of structured and unstructured data to make it easier for users to discover, access, and utilize public data in the cloud. These datasets are hosted for free.
Google cluster data 2011
A 29 day's worth of Borg scheduler information from May 2011, on a Google compute cluster of about 12.5k machines.
Google cluster data 2019
A 1-month trace of every job submission, task assignment, and resource-usage data from eight Borg cells (clusters) in May 2019.
Google cluster data 2021
This repository describes various traces from parts of the Google cluster management software and systems.
Google facial expression comparison dataset
This dataset consists of face image triplets along with annotations (by multiple human raters) that specify which two faces in each triplet form the most similar pair in terms of facial expression.
Google Hardware Accelerator Exploration Data
A dataset which contains the latency of running a variety of neural network models on a Google template-based hardware accelerator across different architecture configurations.
Google Landmarks Dataset v2
The Google Landmarks dataset (GLDv2) is a large-scale benchmark for fine-grained instance-level recognition. It contains over 5M images of natural or human-made landmarks and has protocols for evaluating object recognition and image retrieval.
Google Open Images Mutual Gaze dataset
This dataset consists of images along with annotations that specify whether two faces in the photo are looking at each other. This dataset is intended to aid researchers working on topics related to social behavior, visual attention, etc.
Google Patent Phrase Similarity Dataset
This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related.
Google Workload Traces 2022
The Google Workload Traces capture the addresses of instruction and memory accesses during execution. These traces aim to help system designers better understand Warehouse-Scale Computing workloads and develop new solutions for front-end and data-access bottlenecks.
Gujarati [gn-in] multi-speaker speech
Speech dataset containing about 4,250 transcribed high-quality audio of Gujarati [gn-in] sentences recorded by volunteers.
HDR+ Burst Photography Dataset
An archive of full-resolution raw image bursts over a wide range of scenes, along with the results from Google's HDR+ camera software for comparison.
Hinglish-TOP
Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset and 170K generated utterance using the CST5 augmentation technique introduced in our paper.
Human Pouring Videos
Videos of people pouring a variety of liquids from and into a variety of receptacles, used for research on unsupervised imitation learning.
InFormal Dataset
InFormal is a formality style transfer dataset for four Indic Languages. The dataset is made up of a pair of sentences and a corresponding gold label identifying the more formal as well as semantic similarity. This dataset can be used as an evaluation set for style transfer tasks in Indic Languages.
Javanese [jv-id] ASR
Speech dataset containing about 185,000 transcribed audio of Javanese [jv-jd] sentences recorded by volunteers.
Javanese [jv-id] multi-speaker speech dataset
Speech dataset containing about 5,800 transcribed high-quality audio Javanese [jv-id] sentences recorded by volunteers.
Kannada [kn-in] multi-speaker speech
Speech dataset containing about 4,400 transcribed high-quality audio of Kannada [kn-in] sentences recorded by volunteers.
Khmer [km-kh] multi-speaker speech dataset
Speech dataset containing about 2,900 transcribed high-quality audio of Khmer [km-kh] sentences recorded by volunteers.
KIP Distilled Datasets
These are distilled datasets derived from MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN using infinitely wide convolutional networks. Sample result: Over 64% test accuracy on CIFAR-10 achieved using only 10 images.
Lens Flare
High-quality RGB images of typical lens flare against a black background. Among them, ~2k are captured with a typical smartphone camera, and ~3k are simulated computationally.
LibriTTS
Large-scale corpus of English speech for TTS research
LibriTTS-R
LibriTTS-R is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.. To improve sound quality, a speech restoration model, 'Miipher' was used.
MAESTRO
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) is a dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.
Malayalam [ml-in] multi-speaker speech
Speech dataset containing about 4,100 transcribed high-quality audio from Malayalam [ml-in] sentences recorded by volunteers.
Marathi [mr-in] multi-speaker speech
Speech dataset containing about 1,500 transcribed high-quality audio from Marathi [mr-in] sentences recorded by volunteers.
MD3: Multi-dialect dataset of dialogues
The MD3 dataset features audio and transcripts of thousands of conversational dialogues in English from India, Nigeria, and the United States. In each dialogue, speakers are prompted with an information-sharing intent, which is an image or phrase.
Metaphorical Inference Questions and Answers
MiQA assesses the capability of language models to reason with conventional metaphors. It combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by selecting between the literal and metaphorical register.
Mostly Basic Python Problems (MBPP)
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.
MSR2022 OCEAN mailing list dataset
The MSR2022 OCEAN mailing list dataset is a normalized version of the dataset aggregated by Project OCEAN - Project Datasets (https://github.com/google/project-OCEAN) created for submission into the Data and Tool Showcase track for the Mining Software Repositories Conference in 2022.
MT-Opt dataset
Datasets for the MT-Opt paper
MultiBERTs Predictions on Winogender
Predictions of BERT on Winogender before and after several different types of interventions. This is extra material to support the publication "The MultiBERTs, BERT Reproductions for Robustness Analysis", ICLR'22 (Section 4: "Application: Gender Bias in Coreference Systems").
MultilingualOpenRelations15
Relation extraction is the task of assigning a semantic relationship between a pair of arguments. This dataset provides automatically extracted relations obtained using the algorithm in Faruqui and Kumar (2015) and the human annotations for evaluating the algorithm in French, Russian and Hindi.
Multi-view Human Pouring Videos
A variety of people pouring liquids into containers, taken from multiple angles, which can be used to learn representations of the abstract task of pouring for robot learning.
MusicCaps
The MusiCaps dataset contains 5.5k high-quality music captions written by musicians. Each is describing a 10s clip of music from YouTube.
Natural Language Understanding Uncertainty Evaluation
NaLUE is a relabelled and aggregated version of three large NLU corpuses CLINC150, Banks77 and HWU64. It contains 50k+ utterances spanning 18 verticals, 77 domains, and ~260 intents. In this task, the model needs to map each user utterance to a 3-token sequence of (vertical, domain, intent).
Nepali [ne-np] ASR
Speech dataset containing about 157,000 transcribed audio of Nepali [ne-np] sentences recorded by volunteers.
Nepali [ne-np] multi-speaker speech
Speech dataset containing about 2,000 transcribed high-quality audio of Nepali [ne-np] sentences recorded by volunteers.
Nigerian English [en-ng] multi-speaker speech
Speech dataset containing about 3,350 transcribed high-quality audio of Nigerian English [en-ng] sentences recorded by volunteers.
Noun Verb
This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity. English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite having achieved 97%+ accuracy on the WSJ Penn Treebank since 2002. These mistakes have been difficult to quantify and make taggers less useful to downstream tasks such as translation and text-to-speech synthesis.
NSynth
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from audio libraries, we generated four second, monophonic 16kHz audio snippets.
NY Times Annotated Corpus Dataset
The data included in this release accompanies the paper, entitled "A New Entity Salience Task with Millions of Training Examples" by Jesse Dunietz and Dan Gillick (EACL 2014).
Objectron
Object-centric short videos with pose annotation
OpenContrails: Public Contrails Detection
OpenContrails dataset containing twenty thousand GOES-16 scenes with pixel-level labels
Open Images
A dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
Open Images Extended - Crowdsourced
Additional imagery sets to the main Open Images dataset, to improve its diversity (geographic, cultural, demographic, subject matter, etc). Currently composed of ~478K images contributed by users of the Crowdsource app.
Optical polarization data from Curie
This dataset contains measurements of the state of polarization of optical signals traversing the Curie submarine cable.
Peruvian Spanish [es-pe] multi-speaker speech
Speech dataset containing about 5,450 transcribed high-quality audio of Peruvian Spanish [es-pe] sentences recorded by volunteers.
Procedurally Generated Random Objects
A large collection of procedurally-generated simulated 3D objects for robotic manipulation experiments.
Puerto Rico Spanish [es-pr] multi-speaker speech dataset
Speech dataset containing about 600 transcribed high-quality audio of Puerto Rico Spanish [es-pr] sentences recorded by volunteers.
Query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
QUEST
QUEST is a dataset of 3357 natural language queries with implicit set operations, that map to a set of entities corresponding to Wikipedia documents.
QuickDraw
A collection of 50 million drawings across 345 categories, captured as timestamped vectors and tagged with metadata.
RealEstate 10K
A dataset of camera trajectories derived from YouTube video, intended to aid researchers working in 3D computer vision, graphics, and view synthesis.
Re-contextualizing Fairness in NLP for India Data
This is a dataset of societal stereotypes in India along the Region and Religion axes along with list of identity terms and templates intended to be used for reproducing the results from the paper "Re-contextualizing Fairness in NLP: The Case of India" (https://arxiv.org/abs/2209.12226).
Robot Arm Grasping
A collection of 650k attempts by a robot arm to grasp a variety of objects. The dataset contains RGB-D views of the arm, gripper and objects, along with actuation and position parameters.
Robot Arm Pushing
A collection of 95k examples of a robot arm pushing a variety of objects. The dataset contains RGB-D views of the arm, gripper and objects, along with actuation and position parameters.
Room-Across-Room
A dataset of 126,069 indoor navigation instructions in English, Hindi and Telugu. Each instruction describes a trajectory through a realistic 3D building capture from the Matterport3D dataset.
Scanned Objects by Google Research
Scanned Objects by Google Research is a dataset of common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research.
Schema-Guided Dialogue
A dataset consisting of over 20k annotated multi-domain, task-oriented conversations spanning 45 services across 20 domains. Schemas describing service APIs are provided to enable and evaluate zero-shot transfer to entirely unseen services and domains.
ScreenQA
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
SeeGULL
Stereotype benchmark datasets are crucial to detect and mitigate social stereotypes in NLP models. SeeGULL is a broad-coverage stereotype dataset in English, and contains stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents.
SemCor and Masc documents annotated with NOAD word senses
Word sense annotations on the popular MASC and SemCor datasets, manually annotated with senses from the New Oxford American Dictionary, along with mappings from the NOAD identifiers to the popular English Wordnet dictionary.
Sinhala [si-lk] ASR
Speech dataset containing about 185,000 transcribed audio of Sinhala [si-lk] sentences recorded by volunteers.
Sinhala [si-lk] multi-speaker speech
Speech dataset containing about 2,000 transcribed high-quality audio of Sinhala [si-lk] sentences recorded by volunteers.
Soft Attributes
The dataset consists of sets of movies, annotated with a single English soft attribute (subjective descriptive property, such as 'confusing' or 'romantic') and a reference movie. For each set, a crowd worker has placed the movies into three sets: more, equally, and less than the reference movie.
Specialized Rater Pools data 2022
This dataset comes from a study designed to understand whether annotators with different self-described identities interpret toxicity differently. It contains the unaggregated toxicity annotations of 25,500 comments from pools of raters who self-identify as African American, LGBTQ, or neither.
Sundanese [su-id] ASR
Speech dataset containing about 219,000 transcribed audio of Sundanese [su-id] sentences recorded by volunteers.
Sundanese [su-id] multi-speaker speech
Speech dataset containing about 4,200 transcribed high-quality audio of Sundanese [su-id] sentences recorded by volunteers.
Symptom Search Dataset
The Symptom Search Dataset shows aggregated, anonymized trends in Google searches for 420 health symptoms, signs, and conditions. It provides daily and weekly time series for each region showing the relative volume of searches for each symptom. It is available in the US, UK, AU, IE, NZ, and SG.
Tamil [te-in] multi-speaker speech dataset
Speech dataset containing about 4,250 transcribed high-quality audio from Tamil [ta-in] sentences recorded by volunteers.
Taskmaster-1
13,215 English task-based, annotated dialogs in six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Taskmaster-2
Over 17,000 spoken, annotated dialogs in seven domains collected using the "Wizard of Oz" (human-in-the-loop) platform.
Telugu [te-in] multi-speaker speech
Speech dataset containing about 4,450 transcribed high-quality audio from Telugu [te-in] sentences recorded by volunteers.
The Universal Dependency Treebank Project
A set of treebanks for multiple languages annotated in basic Stanford-style dependencies.
TimeDial
~1.5k annotated dialogs with multiple choices or a masked (temporal) span.
Translated Wikipedia Biographies
The Translated Wikipedia Biographies dataset has been designed to evaluate gender accuracy in long text translations (multiple sentences or passages). The set is designed to analyze gender errors in machine translation like incorrect gender choices in pronouns, possessives and gender agreement.
UI understanding data for UIBert
Datasets for two UI understanding tasks: app similar component retrieval (AppSim) and referring referring expression component retrieval (RefExp) tasks.
UK and Ireland English Dialects
Speech dataset containing about 17,500 transcribed high-quality audio recordings in different English locales from the UK and Ireland. Recordings were performed by volunteers, who self reported their dialect.
UniNum
UniNum is a database of number names for 186 languages, locales, and scripts made available by Google.
Upwelling irradiance from GOES-16
Machine learned models that estimate wideband irradiance from 2km narrow-band radiances (using co-aligned satellite imagery as training data) and so can be used to make satellite-driven estimates of contrail warming.
UserLibri
In UserLibri, the existing popular LibriSpeech dataset is reorganized into individual “user” datasets consisting of paired audio-transcript examples and domain-matching text-only data for each user. This dataset can be used for research in speech personalization or other language processing fields.
Venezuelan Spanish [es-ve] multi-speaker speech
Speech dataset containing about 3,350 transcribed high-quality audio of Venezuelan Spanish [es-ve] sentences recorded by volunteers.
VideoCC
VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
Voice Assistant Failures Dataset
This is a dataset of 199 failures that 107 users have encountered when interacting with commercial voice assistants.
VRDU: Visually Rich Document Understanding
We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types, complex templates, and diversity of layouts within a single document type.
What's Cookin'
A list of cooking-related Youtube video ids, along with time stamps marking the (estimated) start and end of various events.
WikiAtomicEdits
A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
wiki-conciseness dataset
This is a manually curated evaluation set in English for concise rewrites of 2000 Wikipedia sentences.
WikiFact
Wikipedia and WikiData based dataset that can be used to train relationship classifiers and fact extraction models.
Wikilinks: 40 Million Entities in Context
An entity resolution set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. Links inserted by the web page authors can be used to disambiguate mentions.
Wikipedia Generation Dataset
The task is to generate Wikipedia articles from the references at the end of the Wikipedia page and the top ten search results for the Wikipedia topic.
Wikipedia Translated Clusters
Introductions to English Wikipedia articles and their parallel versions in 10 other languages, with machine translations to English. Also includes synthetic corruptions to the English versions, to be identified with NLI models.
WikiReading: a large-scale NLU task over Wikipedia and Wikidata
This is a publicly available natural language understanding (NLU) dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles.
WikiSplit
One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
WIT Wikipedia-based Image Text Dataset
WIT is a large Multimodal, Multilingual dataset created using Wikipedia data. WIT contains ~37M+ image-text example sets across 108 languages. This makes WIT one of the biggest image-text dataset publicly available in addition to it being very entity-rich and providing contextual information.
Word Vector Models
Dataset of 3M words and phrases represented as 300 dimensional embedding vectors; dataset of 1.4M freebase machine IDs represented as 1000 dimensional embedding vectors.
XSum Hallucination Annotations
The dataset consists of faithfulness and factuality annotations of XSum summaries from our paper ''On Faithfulness and Factuality in Abstractive Summarization" at ACL 2020. We have crowdsourced 3 judgements for each of 500x5 document-system pairs.
YouTube-8M
Large-scale video dataset that consists of millions of YouTube video IDs labeled with over 3,800 visual entities. It includes precomputed audio-visual features to reduce training costs.
YouTube-8M Segments
Annotation and temporal localization dataset with over 200,000 human-verified segments with labels drawn from 1000 classes. It includes precomputed audio-visual features to reduce training costs.
YouTube-BoundingBoxes
A large-scale data set of video URLs with densely-sampled high-quality single-object bounding box annotations for detection and tracking.
YouTube Speakers
List of videos selected from the GoogleTechTalks channel, grouped by speaker.