Jump to Content
Jesse Engel

Jesse Engel

At Google Brain, I am performing research at the intersection of creativity and learning as part of the Magenta project. I have a UC Berkeley ^3 degree (BA, PhD, Postdoc) and my research background is diverse, including work in Astrophysics, Materials Science, Chemistry, Electrical Engineering, Computational Neuroscience, and now Machine Learning. For more details on my music and side projects check out my personal website.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Neil Zeghidour
    Christian Frank
    under review (2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    Noise2Music: Text-conditioned Music Generation with Diffusion Models
    Qingqing Huang
    Daniel S. Park
    Tao Wang
    Zhengdong Zhang
    Zhishuai Zhang
    Jiahui Yu
    Christian Frank
    William Chan
    Wei Han
    (2023)
    Preview abstract We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. View details
    MT3: Multi-task Multitrack Music Transcription
    Josh Gardner
    Curtis Glenn-Macway Hawthorne
    ICLR 2022 (to appear)
    Preview abstract Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are ``low resource'', as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. View details
    MIDI-DDSP: Hierarchical modeling of music for detailed control
    Yusong Wu
    Yi Deng
    Rigel Jacob Swavely
    Kyle Kastner
    TIm Cooijmans
    Aaron Courville
    ICLR 2022 (2022) (to appear)
    Preview abstract Musical expression requires control of both \textit{what} notes that are played, and \textit{how} they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience. View details
    Preview abstract Data is the lifeblood of modern machine learning systems, including for those in Music Information Retrieval (MIR). However, MIR has long been mired by small datasets and unreliable labels. In this work, we propose to break this bottleneck using generative models. By pipelining a generative model of notes (Coconet trained on Bach Chorales) with a structured synthesis model of chamber ensembles (MIDI-DDSP trained on URMP), we demonstrate a system capable of producing unlimited amounts of realistic chorale music with rich annotations including mixes, stems, MIDI, note-level performance attributes (staccato, vibrato, etc.), and even fine-grained synthesis parameters (pitch, amplitude, etc.). We call this system the \textbf{Chamber Ensemble Generator (CEG)}, and use it to generate a large dataset of chorales from four different chamber ensembles (CocoChorales). We demonstrate that data generated using our approach improves state-of-the-art models for music transcription and source separation, and we release both the system and the dataset as an open-source foundation for future work in the MIR community. View details
    Preview abstract Automatic Music Transcription (AMT), in particular the problem of automatically extracting notes from audio, has seen much recent progress via the training of neural network models on musical audio recordings paired with aligned ground-truth note labels. However, progress is currently limited by the difficulty of obtaining such note labels for natural audio recordings at scale. In this paper, we take advantage of the fact that for monophonic music, the transcription problem is much easier and largely solved via modern pitch-tracking methods. Specifically, we show that we are able to combine recordings of real monophonic music (and their transcriptions) into artificial and musically-incoherent mixtures, greatly increasing the scale of labeled training data. By pretraining on these mixtures, we can use a larger neural network model and significantly improve upon the state of the art in multi-instrument polyphonic transcription. We demonstrate this improvement across a variety of datasets and in a ``zero-shot'' setting where the model has not been trained on any data from the evaluation domain. View details
    Sequence-to-Sequence Piano Transcription with Transformers
    Curtis Glenn-Macway Hawthorne
    Rigel Jacob Swavely
    ISMIR (2021) (to appear)
    Preview abstract Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like outputs for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design. View details
    Symbolic Music Generation with Diffusion Models
    Gautam Mittal
    Curtis Glenn-Macway Hawthorne
    ISMIR 2021 (2021) (to appear)
    Preview abstract Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process of a Markov chain and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show promising unconditional generation results compared to an autoregressive language model operating over the same continuous embeddings. View details
    Tone Transfer: In-Browser Interactive Neural Audio Synthesis
    Michelle Carney
    Chong Li
    Ping Yu
    https://hai-gen2021.github.io/ (2021) (to appear)
    Preview abstract Tone Transfer lets you transform everyday sounds into musical instruments. Record and upload audio directly into the browser and hear our machine learning models re-render it into saxophones, flutes and more! Don’t fancy singing? Play around with a curated set of samples that will get your creative juices flowing! Tone Transfer was born from a year-long collaboration between two teams within Google Research: Magenta and AIUX. AI Researchers, UX engineers and designers worked together to create an experience that opens up the magic of audio machine learning to a wider audience; from musicians to non-coders alike. Tone Transfer is built on a technology Magenta open-sourced earlier this year called Differentiable Digital Signal Processing or DDSP. At first, Magenta’s only demo was a technical colab notebook intended for folks with coding backgrounds. Through many iterations of design explorations and user research, the AIUX team developed and refined an experience that makes DDSP’s sound transformation approachable for everyone and more fun than ever to play with! View details
    Variable-rate Discrete Representation Learning
    Sander Dieleman
    Charlie Nash
    Karen Simonyan
    ArXiv (2021)
    Preview abstract Semantically meaningful information content in perceptual signals is usually unevenly distributed. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high level variable-rate discrete representations of sequences, and apply them to speech signals. We show that the capacity of the resulting event-based representations automatically grows or shrinks depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations. View details
    Encoding Musical Style with Transformer Autoencoders
    Kristy Choi
    Curtis Hawthorne
    Monica Dinculescu
    ICML (2020)
    Preview abstract Music Transformer is a recently developed generative model that leverages self-attention based on relative positioning to achieve state-of-the-art music generation. However, adapting the trained generative model to user preferences has proven to be cumbersome. In this work, we propose a variety of techniques to enable more fine-grained control of user input. Specifically, we condition on performance and melody inputs to learn musical representations that generalize well across a variety of different musical tasks. Empirically, we demonstrate the effectiveness of our method on the MAESTRO dataset and an internal 10,000+ hour dataset of YouTube piano performances. We achieve improvements in terms of log-likelihood and improvements in terms of mean listening scores. View details
    Preview abstract Classifier metrics, such as accuracy and F1 score, often serve as proxies for performance in downstream tasks. For the case of generative systems that use predicted labels as inputs, accuracy is a good proxy only if it aligns with the perceptual quality of generated outputs. Here, we demonstrate this effect using the example of automatic drum transcription (ADT). We optimize classifiers for downstream generation by predicting expressive dynamics (velocity) and show with listening tests that they produce outputs with improved perceptual quality, despite achieving similar results on classification metrics. To train expressive ADT models, we introduce the Expanded Groove MIDI dataset (E-GMD), a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum dataset with annotations of velocity. We make this new dataset available under a Creative Commons license along with open source code for training and a pre-trained model for inference. View details
    DDSP: Differentiable Digital Signal Processing
    Lamtharn (Hanoi) Hantrakul
    Chenjie Gu
    ICLR 2020 (2020)
    Preview abstract Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library will be made available upon paper acceptance and we encourage further contributions from the community and domain experts. View details
    Self-supervised Pitch Detection by Inverse Audio Synthesis
    Lamtharn (Hanoi) Hantrakul
    Rigel Jacob Swavely
    Curtis Glenn-Macway Hawthorne
    ICML 2020 Workshop on Self-supervision in Audio and Speech (2020) (to appear)
    Preview abstract Audio scene understanding, parsing sound into a hierarchy of meaningful parts, is an open problem in representation learning. Sound is a particularly challenging domain due to its high dimensionality, sequential dependencies and hierarchical structure. Differentiable Digital Signal Processing (DDSP) greatly simplifies the forward problem of generating audio by introducing differentiable synthesizer and effects modules that combine strong signal priors with end-to-end learning. Here, we focus on the inverse problem, inferring synthesis parameters to approximate an audio scene. We demonstrate that DDSP modules can enable a new approach to self-supervision, generating synthetic audio with differentiable synthesizers and training feature extractor networks to infer the synthesis parameters. By building a hierarchy from sinusoidal to harmonic representations, we show that it possible to use such an inverse modeling approach to disentangle pitch from timbre, an important task in audio scene understanding. View details
    Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
    Jennifer McCleary
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Joint Conference on Artificial Intelligence (IJCAI) 2018 (2020), pp. 1-9
    Preview abstract A known deficit of modern machine learning (ML) and deep learning (DL) methodology is that models must be carefully fine-tuned in order to solve a particular task. Most algorithms cannot generalize well to even highly similar tasks, let alone exhibit signs of general artificial intelligence (AGI). To address this problem, researchers have explored developing loss functions that act as intrinsic motivators that could motivate an ML or DL agent to learn across a number of domains. This paper argues that an important and useful intrinsic motivator is that of social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, by optimizing the model to produce sketches that it predicts will lead to more positive facial expressions. We show in multiple independent evaluations that the model trained with facial feedback produced sketches that are more highly rated, and induce significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Preview abstract End-to-end optimization has achieved state-of-the-art performance on many specific problems, but there is no straight-forward way to combine pretrained models for new problems. Here, we explore improving modularity by learning a post-hoc interface between two existing models to solve a new task. Specifically, we take inspiration from neural machine translation, and cast the challenging problem of cross-modal domain transfer as unsupervised translation between the latent spaces of pretrained deep generative models. By abstracting away the data representation, we demonstrate that it is possible to transfer across different modalities (e.g., image-to-audio) and even different types of generative models (e.g., VAE-to-GAN). We compare to state-of-the-art techniques and find that a straight-forward variational autoencoder is able to best bridge the two generative models through learning a shared latent space. We can further impose supervised alignment of attributes in both domains with a classifier in the shared latent space. Through qualitative and quantitative evaluations, we demonstrate that locality and semantic alignment are preserved through the transfer process, as indicated by high transfer accuracies and smooth interpolations within a class. Finally, the modular structure enables efficient training of new interface models, decoupling from the cost of training the base generative models. View details
    Preview abstract One of the areas of interest for music generative models is to empower individual expression. But how can a creator personalize a machine learning model to make it their own? Training a custom deep neural network model like Music Transformer, MusicVAE or SketchRNN from scratch requires significant amounts of data (millions of examples) and compute resources (specialized hardware like GPUs/TPUs) as well as expertise in hyper parameter tuning. Without sufficient data, models are either unable to produce realistic output (underfitting), or they memorize the training examples and are unable to generalize to produce varied outputs (overfitting) – it would be like trying to learn all of music theory from a single song. We introduce a new model for sample-efficient adaptation to user data, based on prior work by Engel et al [1]. We can quickly train this small, personalized model to control a much larger, more general pretrained latent variable model. This allows us to generate samples from only the portions of the latent space we are interested in without having to retrain the large model from scratch. We demonstrate this technique in an online demo, that lets users upload their own MIDI files (either melodies or multi-instrument songs) and generate samples that sound like their input. View details
    Magenta Studio: Augmenting Creativity with Deep Learning in Ableton Live
    Yotam Mann
    Jon Gillick
    Monica Dinculescu
    Carey Radebaugh
    Curtis Hawthorne
    Proceedings of the International Workshop on Musical Metacreation (MUME) (2019)
    Preview abstract The field of Musical Metacreation (MuMe) has pro-duced impressive results for both autonomous and in-teractive creativity. However, there are few examplesof these systems crossing over to the “mainstream” ofmusic creation and consumption. We tie together ex-isting frameworks (Electron, TensorFlow.js, and MaxFor Live) to develop a system whose purpose is tobring the promise of interactive MuMe to the realmof professional music creators. Combining compellingapplications of deep learning based music generationwith a focus on ease of installation and use in a pop-ular DAW, we hope to expose more musicians and pro-ducers to the potential of using such systems in theircreative workflows. Our suite of plug-ins for AbletonLive, named Magenta Studio, is available for downloadathttp://g.co/magenta/studioalong with itsopen source implementation. View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    Fast and Flexible Neural Audio Synthesis
    Lamtharn (Hanoi) Hantrakul
    Chenjie Gu
    ISMIR 2019 (2019) (to appear)
    Preview abstract Autoregressive neural networks, such as WaveNet, have opened up new avenues for expressive audio synthesis. High-quality speech synthesis utilizes detailed linguistic features for conditioning, but comparable levels control have yet to be realized for musical instruments. Here, we demonstrate an autoregressive model capable of synthesizing realistic audio that closely follows fine-scale temporal conditioning for loudness and fundamental frequency. We find the appropriate choice of conditioning features and architectures improves both the quantitative accuracy of audio resynthesis and qualitative responsiveness to creative manipulation of conditioning. While large autoregressive models generate audio much slower than realtime, we achieve these results with a much more efficient WaveRNN model, opening the door for exploring real-time interactive audio synthesis with neural networks. View details
    GANSynth: Adversarial Neural Audio Synthesis
    Kumar Krishna Agrawal
    Shuo Chen
    Ishaan Gulrajani
    Chris Donahue
    ICLR (2019)
    Preview abstract Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modelling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive emperical investigations on the difficult NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts. View details
    Preview abstract We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Though seq2seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix, Vid2Vid) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and learning to invert them has real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score). View details
    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
    Colin Raffel
    Curtis Hawthorne
    International Conference on Machine Learning (ICML) (2018)
    Preview abstract The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at https://goo.gl/magenta/musicvae-code. View details
    Learning via social awareness: improving sketch representations with facial feedback
    Natasha Jaques
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Conference on Learning Representations (2018)
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Preview abstract Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By post-hoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradient-based optimization or amortized actor functions. Combining attribute constraints with a universal “realism” constraint, which enforces similarity to the data distribution, we generate realistic conditional images from an unconditional variational autoencoder. Further, using gradient-based optimization, we demonstrate identity-preserving transformations that make the minimal adjustment in latent space to modify the attributes of an image. Finally, with discrete sequences of musical notes, we demonstrate zero-shot conditional generation, learning latent constraints in the absence of labeled data or a differentiable reward function. View details
    Preview abstract Advances in machine learning have the potential to radically reshape interactions between humans and computers. Deep learning makes it possible to discover powerful representations that are capable of capturing the latent structure of highdimensional data such as music. By creating interactive latent space “palettes” of musical sequences and timbres, we demonstrate interfaces for musical creation made possible by machine learning. We introduce an interface to the intuitive, low-dimensional control spaces for high-dimensional note sequences, allowing users to explore a compositional space of melodies or drum beats in a simple 2-D grid. Furthermore, users can define 1-D trajectories in the 2-D space for autonomous, continuous morphing during improvisation. Similarly for timbre, our interface to a learned latent space of audio provides an intuitive and smooth search space for morphing between the timbres of different instruments. We remove technical and computational barriers by embedding pre-trained networks into a browser-based GPU-accelerated framework, making the systems accessible to a wide range of users while maintaining potential for creative flexibility and personalization. View details
    Onsets and Frames: Dual-Objective Piano Transcription
    Curtis Hawthorne
    Erich Elsen
    Jialin Song
    Colin Raffel
    Sageev Oore
    Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
    Preview abstract We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions. View details
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that function as intrinsic motivators in the absence of external rewards. This paper takes the position that current research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for more rapid learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more smiling and less frowning than the baseline. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
    Cinjon Resnick
    Sander Dieleman
    Karen Simonyan
    Mohammad Norouzi
    ICML (2017)
    Preview abstract Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive. View details
    Deep Music: Towards Musical Dialogue
    Mason Bretan
    Sageev Oore
    Larry Heck
    AAAI, AAAI, AAAI (2017)
    Preview abstract Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time applica- tion the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker. View details
    Preview abstract In this work we develop recurrent variational autoencoders (VAEs) trained to reproduce short musical sequences and demonstrate their use as a creative device both via random sampling and data interpolation. Furthermore, by using a novel hierarchical decoder, we show that we are able to model long sequences with musical structure for both individual instruments and a three-piece band (lead, bass, and drums). Finally, we demonstrate the effectiveness of scheduled sampling in significantly improving our reconstruction accuracy. View details
    No Results Found