Google Research

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

  • Alham Fikri Aji
  • Genta Indra Winata
  • Fajri Koto
  • Samuel Cahyawijaya
  • Ade Romadhony
  • Rahmad Mahendra
  • Kemal Kurniawan
  • David Moeljadi
  • Radityo Eko Prasojo
  • Timothy Baldwin
  • Jey Han Lau
  • Sebastian Ruder
ACL 2022 (2022) (to appear)

Abstract

NLP research is impeded by a lack of resources and awareness of the challenges presented by languages and dialects beyond English. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700 languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology for not only languages of Indonesia, but also other underrepresented languages.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work