Wiki-40B: Multilingual Language Model Dataset

Mandy Guo; Zihang Dai; Denny Vrandecic; Rami Al-Rfou

Wiki-40B: Multilingual Language Model Dataset

Mandy Guo

Zihang Dai

Denny Vrandecic

Rami Al-Rfou

LREC 2020

Download Google Scholar

Abstract

We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Wiki-40B: Multilingual Language Model Dataset

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs