Jump to Content

Google Dataset Search by the Numbers

Omar Benjelloun
Shiyu Chen
International Semantic Web Conference (ISWC-2020), In-Use Track (to appear)

Abstract

Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search tool extracts dataset metadata---expressed in the schema.org vocabulary---from webpages in order to make datasets discoverable. Since the tool's inception, the number of datasets described in schema.org has grown from about 500K to almost 30M, and has become a valuable snapshot of what data on the Web looks like. This paper analyzes the corpus of dataset metadata we collected. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We discuss such questions as where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. We describe our methods for collecting and analyzing this data as well as our observations. We conclude with identifying the gaps and possible future work to help make data more accessible.

Research Areas