Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pre-training dataset for multimodal machine learning models as well as for evaluations.
A few unique advantages of WIT:
- One of the largest multimodal datasets by the number of image-text examples.
- A massively multilingual dataset (first of its kind) with coverage for 108 languages.
- First image-text dataset with page level metadata and contextual information.
- A collection of diverse set of concepts and real world entities.
- Brings forth challenging real-world test sets.
WIT was awarded the Wikimedia Research Award of the Year.