WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Krishna Srinivasan

Karthik Raman

Jiecao Chen

Mike Bendersky

Marc Najork

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21) (2021)

Download Google Scholar

Abstract

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of 11 million+ unique images with over 37 million entity rich text descriptions associated with these images in Wikipedia from over 100 languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset (at the time of writing). Second, it is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 10K examples) and provides cross-lingual texts for many images. Third, it represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, as we demonstrate empirically, WIT provides a very challenging real-world test set that empirically highlights the need for learning improvements in tasks such as Retrieval and Captioning.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities