We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.View details
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at [wavegrad.github.io/specgrad/].View details
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. We find that it can generate high fidelity audio samples using as few as six iterations. Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at https://wavegrad.github.io/View details
This paper introduces WaveGrad 2, an end-to-end non-autoregressive generative model for text-to-speech synthesis trained to estimate the gradients of the data density. Unlike recent TTS systems which are a cascade of separately learned models, during training the proposed model requires only text or phoneme sequence, learns all parameters end-to-end without intermediate features, and can generate natural speech audio with great varieties. This is achieved by the score matching objective, which optimizes the network to model the score function of the real data distribution. Output waveforms are generated using an iterative refinement process beginning from a random noise sample. Like our prior work, WaveGrad 2 offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps. Experiments reveal that the model can generate high fidelity audio, closing the gap between end-to-end and contemporary systems, approaching the performance of a state-of-the-art neural TTS system. We further carry out various ablations to study the impact of different model configurations.View details
No Results Found
We're always looking for more talented, passionate people.