Jesse Engel

Jesse Engel

At Google Brain, I am performing research at the intersection of creativity and learning as part of the Magenta project. I have a UC Berkeley ^3 degree (BA, PhD, Postdoc) and my research background is diverse, including work in Astrophysics, Materials Science, Chemistry, Electrical Engineering, Computational Neuroscience, and now Machine Learning. For more details on my music and side projects check out my personal website.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Neil Zeghidour
    Christian Frank
    under review(2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    Noise2Music: Text-conditioned Music Generation with Diffusion Models
    Qingqing Huang
    Daniel S. Park
    Tao Wang
    Nanxin Chen
    Zhengdong Zhang
    Zhishuai Zhang
    Jiahui Yu
    Christian Frank
    William Chan
    Wei Han
    (2023)
    Preview abstract We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. View details
    Preview abstract Data is the lifeblood of modern machine learning systems, including for those in Music Information Retrieval (MIR). However, MIR has long been mired by small datasets and unreliable labels. In this work, we propose to break this bottleneck using generative models. By pipelining a generative model of notes (Coconet trained on Bach Chorales) with a structured synthesis model of chamber ensembles (MIDI-DDSP trained on URMP), we demonstrate a system capable of producing unlimited amounts of realistic chorale music with rich annotations including mixes, stems, MIDI, note-level performance attributes (staccato, vibrato, etc.), and even fine-grained synthesis parameters (pitch, amplitude, etc.). We call this system the \textbf{Chamber Ensemble Generator (CEG)}, and use it to generate a large dataset of chorales from four different chamber ensembles (CocoChorales). We demonstrate that data generated using our approach improves state-of-the-art models for music transcription and source separation, and we release both the system and the dataset as an open-source foundation for future work in the MIR community. View details
    MT3: Multi-task Multitrack Music Transcription
    Josh Gardner
    Curtis Glenn-Macway Hawthorne
    ICLR 2022 (to appear)
    Preview abstract Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are ``low resource'', as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. View details
    Preview abstract Automatic Music Transcription (AMT), in particular the problem of automatically extracting notes from audio, has seen much recent progress via the training of neural network models on musical audio recordings paired with aligned ground-truth note labels. However, progress is currently limited by the difficulty of obtaining such note labels for natural audio recordings at scale. In this paper, we take advantage of the fact that for monophonic music, the transcription problem is much easier and largely solved via modern pitch-tracking methods. Specifically, we show that we are able to combine recordings of real monophonic music (and their transcriptions) into artificial and musically-incoherent mixtures, greatly increasing the scale of labeled training data. By pretraining on these mixtures, we can use a larger neural network model and significantly improve upon the state of the art in multi-instrument polyphonic transcription. We demonstrate this improvement across a variety of datasets and in a ``zero-shot'' setting where the model has not been trained on any data from the evaluation domain. View details
    MIDI-DDSP: Hierarchical modeling of music for detailed control
    Yusong Wu
    Yi Deng
    Rigel Jacob Swavely
    Kyle Kastner
    TIm Cooijmans
    Aaron Courville
    ICLR 2022(2022) (to appear)
    Preview abstract Musical expression requires control of both \textit{what} notes that are played, and \textit{how} they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience. View details
    Tone Transfer: In-Browser Interactive Neural Audio Synthesis
    Michelle Carney
    Chong Li
    Edwin Toh
    Ping Yu
    https://hai-gen2021.github.io/(2021) (to appear)
    Preview abstract Tone Transfer lets you transform everyday sounds into musical instruments. Record and upload audio directly into the browser and hear our machine learning models re-render it into saxophones, flutes and more! Don’t fancy singing? Play around with a curated set of samples that will get your creative juices flowing! Tone Transfer was born from a year-long collaboration between two teams within Google Research: Magenta and AIUX. AI Researchers, UX engineers and designers worked together to create an experience that opens up the magic of audio machine learning to a wider audience; from musicians to non-coders alike. Tone Transfer is built on a technology Magenta open-sourced earlier this year called Differentiable Digital Signal Processing or DDSP. At first, Magenta’s only demo was a technical colab notebook intended for folks with coding backgrounds. Through many iterations of design explorations and user research, the AIUX team developed and refined an experience that makes DDSP’s sound transformation approachable for everyone and more fun than ever to play with! View details
    Symbolic Music Generation with Diffusion Models
    Gautam Mittal
    Curtis Glenn-Macway Hawthorne
    ISMIR 2021(2021) (to appear)
    Preview abstract Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process of a Markov chain and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show promising unconditional generation results compared to an autoregressive language model operating over the same continuous embeddings. View details
    Variable-rate Discrete Representation Learning
    Sander Dieleman
    Charlie Nash
    Karen Simonyan
    ArXiv(2021)
    Preview abstract Semantically meaningful information content in perceptual signals is usually unevenly distributed. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high level variable-rate discrete representations of sequences, and apply them to speech signals. We show that the capacity of the resulting event-based representations automatically grows or shrinks depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations. View details
    Sequence-to-Sequence Piano Transcription with Transformers
    Curtis Glenn-Macway Hawthorne
    Rigel Jacob Swavely
    ISMIR(2021) (to appear)
    Preview abstract Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like outputs for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design. View details