Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer; Isaac Caswell; Lisa Wang; Ahsan Wahab; Daan van Esch; Nasanbayar Ulzii-Orshikh; Allahsera Auguste Tapo; Nishant Subramani; Artem Sokolov; Claytone Sikasote; Monang Setyawan; Supheakmungkol Sarin; Sokhar Samb; Benoît Sagot; Clara E. Rivera; Annette Rios; Isabel Papadimitriou; Salomey Osei; Pedro Javier Ortiz Suárez; Iroro Fred Ọ̀nọ̀mẹ̀ Orife; Kelechi Ogueji; Rubungo Andre Niyongabo; Toan Nguyen; Mathias Müller; André Müller; Shamsuddeen Hassan Muhammad; Nanda Muhammad; Ayanda Mnyakeni; Jamshidbek Mirzakhalov; Tapiwanashe Matangira; Colin Leong; Nze Lawson; Sneha Kudugunta; Yacine Jernite; Mathias Jenny; Orhan Firat; Bonaventure F. P. Dossou; Sakhile Dlamini; Nisansa de Silva; Sakine Çabuk Ballı; Stella Biderman; Alessia Battisti; Ahmed Baruwa; Ankur Bapna; Pallavi Baljekar; Israel Abebe Azime; Ayodele Awokoya; Duygu Ataman; Orevaoghene Ahia; Oghenefego Ahia; Sweta Agrawal; Mofetoluwa Adeyemi

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer

Isaac Caswell

Lisa Wang

Ahsan Wahab

Daan van Esch

Nasanbayar Ulzii-Orshikh

Allahsera Auguste Tapo

Nishant Subramani

Artem Sokolov

Claytone Sikasote

Monang Setyawan

Supheakmungkol Sarin

Sokhar Samb

Benoît Sagot

Clara E. Rivera

Annette Rios

Isabel Papadimitriou

Salomey Osei

Pedro Javier Ortiz Suárez

Iroro Fred Ọ̀nọ̀mẹ̀ Orife

Kelechi Ogueji

Rubungo Andre Niyongabo

Toan Nguyen

Mathias Müller

André Müller

Shamsuddeen Hassan Muhammad

Nanda Muhammad

Ayanda Mnyakeni

Jamshidbek Mirzakhalov

Tapiwanashe Matangira

Colin Leong

Nze Lawson

Sneha Kudugunta

Yacine Jernite

Mathias Jenny

Orhan Firat

Bonaventure F. P. Dossou

Sakhile Dlamini

Nisansa de Silva

Sakine Çabuk Ballı

Stella Biderman

Alessia Battisti

Ahmed Baruwa

Ankur Bapna

Pallavi Baljekar

Israel Abebe Azime

Ayodele Awokoya

Duygu Ataman

Orevaoghene Ahia

Oghenefego Ahia

Sweta Agrawal

Mofetoluwa Adeyemi

TACL (2022)

Download Google Scholar

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs