Google Dataset Search by the Numbers

Omar Benjelloun; Shiyu Chen; Natasha Noy

Google Dataset Search by the Numbers

Omar Benjelloun

Shiyu Chen

Natasha Noy

International Semantic Web Conference (ISWC-2020), In-Use Track (to appear)

Download Google Scholar

Abstract

Scientists, governments, and companies increasingly publish datasets on the
Web. Google's Dataset Search tool extracts dataset metadata---expressed in the
schema.org vocabulary---from webpages in order to make datasets discoverable.
Since the tool's inception, the number of datasets described in schema.org has
grown from about 500K to almost 30M, and has become a valuable snapshot of
what data on the Web looks like. This paper analyzes the corpus of dataset
metadata we collected. To the best of our knowledge, this corpus is the
largest and most diverse of its kind. We discuss such questions as where the
datasets originate from, what topics they cover, which form they take, and what
people searching for datasets are interested in. We describe our methods for
collecting and analyzing this data as well as our observations. We conclude
with identifying the gaps and possible future work to help make data more
accessible.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Google Dataset Search by the Numbers

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs