Symbolic Music Generation with Diffusion Models
Abstract
Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process of a Markov chain and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show promising unconditional generation results compared to an autoregressive language model operating over the same continuous embeddings.