Lorax: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages

Trevor Cohn
Alham Fikri Aji
2025

Abstract

As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce Lorax, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. We cover 20 languages, with the addition of 2 politeness registers for 3 of the languages. As a benchmark is essential to the progress itself, this data should provide a useful contribution to the community. We benchmark a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesia and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness 'Krama' Javanese.