Building Machine Translation Systems for the Next Thousand Languages

Ankur Bapna; Isaac Caswell; Julia Kreutzer; Orhan Firat; Daan van Esch; Aditya Siddhant; Mengmeng Niu; Pallavi Nikhil Baljekar; Xavier Garcia; Wolfgang Macherey; Theresa Breiner; Vera Saldinger Axelrod; Jason Riesa; Yuan Cao; Mia Chen; Klaus Macherey; Maxim Krikun; Pidong Wang; Alexander Gutkin; Apu Shah; Yanping Huang; Zhifeng Chen; Yonghui Wu; Macduff Richard Hughes

Building Machine Translation Systems for the Next Thousand Languages

Ankur Bapna

Isaac Caswell

Julia Kreutzer

Orhan Firat

Daan van Esch

Aditya Siddhant

Mengmeng Niu

Pallavi Nikhil Baljekar

Xavier Garcia

Wolfgang Macherey

Theresa Breiner

Vera Saldinger Axelrod

Jason Riesa

Yuan Cao

Mia Chen

Klaus Macherey

Maxim Krikun

Pidong Wang

Alexander Gutkin

Apu Shah

Yanping Huang

Zhifeng Chen

Yonghui Wu

Macduff Richard Hughes

Google Research (2022)

Download Google Scholar

Abstract

In this paper we share findings from our effort towards building practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results across three research domains:
(i) Building clean, web-mined datasets by leveraging semi-supervised pre-training for language-id and developing data-driven filtering techniques; (ii) Leveraging massively multilingual MT models trained with supervised parallel data for over $100$ languages and small monolingual datasets for over 1000 languages to enable translation for several previously under-studied languages; and (iii) Studying the limitations of evaluation metrics for long tail languages and conducting qualitative analysis of the outputs from our MT models. We hope that our work provides useful insights to practitioners working towards building MT systems for long tail languages, and highlights research directions that can complement the weaknesses of massively multilingual pre-trained models in data-sparse settings.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Building Machine Translation Systems for the Next Thousand Languages

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs