Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Shamsuddeen Hassan Muhammad
Alessia Battisti
Annette Rios
Kelechi Ogueji
Sokhar Samb
Nishant Subramani
Yacine Jernite
Claytone Sikasote
Jamshidbek Mirzakhalov
Orevaoghene Ahia
Ahsan Wahab
Mofetoluwa Adeyemi
Bonaventure F. P. Dossou
Benoît Sagot
Sweta Agrawal
Mathias Müller
Ahmed Baruwa
Toan Nguyen
Isabel Papadimitriou
Allahsera Auguste Tapo
Mathias Jenny
Nisansa de Silva
Duygu Ataman
Sakine Çabuk Ballı
Rubungo Andre Niyongabo
Salomey Osei
Israel Abebe Azime
Ayodele Awokoya
Iroro Fred Ọ̀nọ̀mẹ̀ Orife
Nasanbayar Ulzii-Orshikh
Stella Biderman
Pedro Javier Ortiz Suárez
Colin Leong
André Müller
Pallavi Baljekar
Supheakmungkol Sarin
Clara E. Rivera
Julia Kreutzer
Nze Lawson
Tapiwanashe Matangira
Oghenefego Ahia
Sakhile Dlamini
Monang Setyawan
Ayanda Mnyakeni
Nanda Muhammad
Lisa Wang
Artem Sokolov
TACL (2022)

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.