Google Research

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

  • Isaac Caswell
  • Julia Kreutzer
  • Lisa Wang
  • Ahsan Wahab
  • Daan van Esch
  • Nasanbayar Ulzii-Orshikh
  • Allahsera Auguste Tapo
  • Nishant Subramani
  • Artem Sokolov
  • Claytone Sikasote
  • Monang Setyawan
  • Supheakmungkol Sarin
  • Sokhar Samb
  • Benoît Sagot
  • Clara E. Rivera
  • Annette Rios
  • Isabel Papadimitriou
  • Salomey Osei
  • Pedro Javier Ortiz Suárez
  • Iroro Fred Ọ̀nọ̀mẹ̀ Orife
  • Kelechi Ogueji
  • Rubungo Andre Niyongabo
  • Toan Nguyen
  • Mathias Müller
  • André Müller
  • Shamsuddeen Hassan Muhammad
  • Nanda Muhammad
  • Ayanda Mnyakeni
  • Jamshidbek Mirzakhalov
  • Tapiwanashe Matangira
  • Colin Leong
  • Nze Lawson
  • Sneha Kudugunta
  • Yacine Jernite
  • Mathias Jenny
  • Orhan Firat
  • Bonaventure F. P. Dossou
  • Sakhile Dlamini
  • Nisansa de Silva
  • Sakine Çabuk Ballı
  • Stella Biderman
  • Alessia Battisti
  • Ahmed Baruwa
  • Ankur Bapna
  • Pallavi Baljekar
  • Israel Abebe Azime
  • Ayodele Awokoya
  • Duygu Ataman
  • Orevaoghene Ahia
  • Oghenefego Ahia
  • Sweta Agrawal
  • Mofetoluwa Adeyemi
AfricaNLP, Online (2021) (to appear)

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work