MURAL: Multimodal, Multitask Retrieval Across Languages

Aashi Jain; Mandy Guo; Krishna Srinivasan; Ting Chen; Sneha Reddy Kudugunta; Chao Jia; Yinfei Yang; Jason Baldridge

MURAL: Multimodal, Multitask Retrieval Across Languages

Aashi Jain

Mandy Guo

Krishna Srinivasan

Ting Chen

Sneha Reddy Kudugunta

Chao Jia

Yinfei Yang

Jason Baldridge

EMNLP (2021)

Google Scholar

Abstract

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN \cite{jia2021scaling}--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets; more importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL improves zero-shot mean recall by 14.4\% on average for eight under-resourced languages and by 6.6\% on average when fine-tuning. Interestingly, we also find that text representations learned from MURAL cluster based on areal linguistics as well, like the Balkan sprachbund, and not just language genealogy.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

MURAL: Multimodal, Multitask Retrieval Across Languages

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs