MuLan: A Joint Embedding of Music Audio and Natural Language

Qingqing Huang; Aren Jansen; Joonseok Lee; Ravi Ganti; Judith Yue Li; Daniel P. W. Ellis

MuLan: A Joint Embedding of Music Audio and Natural Language

Qingqing Huang

Aren Jansen

Joonseok Lee

Ravi Ganti

Judith Yue Li

Daniel P. W. Ellis

Proceedings of the the 23rd International Society for Music Information Retrieval Conference (ISMIR) (2022) (to appear)

Download Google Scholar

Abstract

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

MuLan: A Joint Embedding of Music Audio and Natural Language

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs