Noise2Music: Text-conditioned Music Generation with Diffusion Models

Qingqing Huang

Daniel S. Park

Tao Wang

Timo Denk

Andy Ly

Nanxin Chen

Zhengdong Zhang

Zhishuai Zhang

Jiahui Yu

Christian Frank

Jesse Engel

Quoc V. Le

William Chan

Zhifeng Chen

Wei Han

(2023)

Download Google Scholar

Abstract

We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities