LinguaMeta: Unified Metadata for Thousands of Languages

Sandy Ritchie; Daan van Esch; Uche Okonkwo; Shikhar Vashishth; Emily Drummond

LinguaMeta: Unified Metadata for Thousands of Languages

Sandy Ritchie

Daan van Esch

Uche Okonkwo

Shikhar Vashishth

Emily Drummond

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Download Google Scholar

Abstract

We introduce LinguaMeta, a unified resource for language metadata for thousands of languages, including language codes, names, number of speakers, writing systems, countries, official status, coordinates, and language varieties. The resources are drawn from various existing repositories and supplemented with our own research. Each data point is tagged for its origin, allowing us to easily trace back to and improve existing resources with more up-to-date and complete metadata. The resource is intended for use by researchers and organizations who aim to extend technology to thousands of languages.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

LinguaMeta: Unified Metadata for Thousands of Languages

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs