Google Research

Wiki-40B: Multilingual Language Model Dataset

LREC 2020

Abstract

We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work