Tamar Lucassen
Authored Publications
Sort By
Writing System and Speaker Metadata for 2,800+ Language Varieties
Sebastian Ruder
Clara E. Rivera
Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France (2022), pp. 5035-5046
Preview abstract
We describe an open-source dataset providing metadata for about 2,800 language varieties used in the world today. Specifically, the dataset provides the attested writing system(s) for each of these 2,800+ varieties, as well as an estimated speaker count for each variety. This data set was developed through internal research and has been used for analyses around language technologies. This is the largest publicly-available, machine-readable resource with writing system and speaker information for the world's languages. We hope the availability of this data will catalyze research in under-represented languages.
View details
Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard
Elnaz Sarbar
Jeremy O'Brien
Evan Elizabeth Crew
Chieu Nguyen
arXiv cs.HC (2019)
Preview abstract
This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we added support for these language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world.
View details