Wiki-40B: Multilingual Language Model Dataset

Zihang Dai
Denny Vrandecic
Rami Al-Rfou
LREC 2020

Abstract

We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning.