Google Dataset Search by the Numbers
Abstract
Scientists, governments, and companies increasingly publish datasets on the
Web. Google's Dataset Search tool extracts dataset metadata---expressed in the
schema.org vocabulary---from webpages in order to make datasets discoverable.
Since the tool's inception, the number of datasets described in schema.org has
grown from about 500K to almost 30M, and has become a valuable snapshot of
what data on the Web looks like. This paper analyzes the corpus of dataset
metadata we collected. To the best of our knowledge, this corpus is the
largest and most diverse of its kind. We discuss such questions as where the
datasets originate from, what topics they cover, which form they take, and what
people searching for datasets are interested in. We describe our methods for
collecting and analyzing this data as well as our observations. We conclude
with identifying the gaps and possible future work to help make data more
accessible.
Web. Google's Dataset Search tool extracts dataset metadata---expressed in the
schema.org vocabulary---from webpages in order to make datasets discoverable.
Since the tool's inception, the number of datasets described in schema.org has
grown from about 500K to almost 30M, and has become a valuable snapshot of
what data on the Web looks like. This paper analyzes the corpus of dataset
metadata we collected. To the best of our knowledge, this corpus is the
largest and most diverse of its kind. We discuss such questions as where the
datasets originate from, what topics they cover, which form they take, and what
people searching for datasets are interested in. We describe our methods for
collecting and analyzing this data as well as our observations. We conclude
with identifying the gaps and possible future work to help make data more
accessible.