Google Dataset Search by the Numbers

Omar Benjelloun

Shiyu Chen

Natasha Noy

International Semantic Web Conference (ISWC-2020), In-Use Track (to appear)

Download Google Scholar

Abstract

Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search tool extracts dataset metadata---expressed in the schema.org vocabulary---from webpages in order to make datasets discoverable. Since the tool's inception, the number of datasets described in schema.org has grown from about 500K to almost 30M, and has become a valuable snapshot of what data on the Web looks like. This paper analyzes the corpus of dataset metadata we collected. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We discuss such questions as where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. We describe our methods for collecting and analyzing this data as well as our observations. We conclude with identifying the gaps and possible future work to help make data more accessible.

Research Areas

Data Management

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Google Dataset Search by the Numbers

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Google Dataset Search by the Numbers

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities