
Erin MacMurray van Liemt
I am a Senior Sociotechnical Researcher at Google Research based in the Los Angeles area and have served as both a researcher and linguist at Google for ~10 years. My area of focus is on building evidence-based community-informed datasets, namely culturally and linguistically diverse ontologies for plualistic AI. In 2020, I was a Google Fellow as part of the Trevor Project Fellowship for the award winning Crisis Contact Simulator, featured as the Time's 100 Best Inventions of 2021. I later applied this work in a second Google.org Fellowship in partnership with Reflex AI in 2023, where I crafted datasets to support peer-to-peer empathy conversation simulators for the Veteran community. Prior to Google Research, I worked on ontology design for Google’s Knowledge Graph and text/image classification on Ads Privacy and Safety teams.
Authored Publications
Sort By
Socially Responsible Data for Large Multilingual Language Models
Zara Wudiri
Mbangula Lameck Amugongo
Alex
Stanley Uwakwe
João Sedoc
Edem Wornyo
Seyi Olojo
Amber Ebinama
Suzanne Dikker
2024
Preview abstract
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years but their training
data is largely English text. There is growing interest in language inclusivity in LLMs, and various efforts are striving for models
to accommodate language communities outside of the Global North1
, which include many languages that have been historically
underrepresented digitally. These languages have been coined as “low resource languages” or “long tail languages”, and LLMs
performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential
benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that
data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting
data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex
sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and
cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several
relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community
partnerships and participatory design approaches. We provide twelve recommendations for consideration when collecting language
data on underrepresented language communities outside of the Global North.
View details