Jump to Content
Adam Roberts

Adam Roberts

Adam Roberts earned his PhD in Computer Science from UC Berkeley with a designated emphasis in Computational and Genomic Biology. He has since focused on combining music and technology at Google, where he helped to organize the world's music knowledge for Google Play and is now applying deep learning to the generation of music and art for Google Brain team's Magenta project.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Character-Aware Models Improve Visual Text Rendering
    Chitwan Saharia
    William Chan
    Sharan Narang
    Irina Blok
    RJ Mical
    Mohammad Norouzi
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (2023)
    Preview abstract Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Applying our learnings to the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples. View details
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Matt Sharifi
    Neil Zeghidour
    Christian Frank
    under review (2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
    Shayne Longpre
    Le Hou
    Albert Webson
    Hyung Won Chung
    Yi Tay
    Barret Zoph
    Jason Wei
    Proceedings of the 40th International Conference on Machine Learning, PMLR (2023), pp. 22631-22648
    Preview abstract We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Slav Petrov
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    LaMDA: Language Models for Dialog Applications
    Aaron Daniel Cohen
    Alena Butryna
    Alicia Jin
    Apoorv Kulshreshtha
    Ben Zevenbergen
    Chung-ching Chang
    Cosmo Du
    Daniel De Freitas Adiwardana
    Dehao Chen
    Dmitry (Dima) Lepikhin
    Erin Hoffman-John
    Hongrae Lee
    Igor Krivokon
    James Qin
    Jamie Hall
    Joe Fenton
    Johnny Soraker
    Lora Mois Aroyo
    Maarten Paul Bosma
    Marc Joseph Pickett
    Marcelo Amorim Menegali
    Marian Croak
    Maxim Krikun
    Meredith Ringel Morris
    Noam Shazeer
    Rachel Bernstein
    Ravi Rajakumar
    Ray Kurzweil
    Romal Thoppilan
    Steven Zheng
    Taylor Bos
    Toju Duke
    Tulsee Doshi
    Vincent Y. Zhao
    Will Rusch
    Yuanzhong Xu
    arXiv (2022)
    Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
    mT5: A massively multilingual pre-trained text-to-text transformer
    Linting Xue
    Aditya Barua
    Colin Raffel
    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021), Association for Computational Linguistics, Online, pp. 483-498
    Preview abstract The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available. View details
    DDSP: Differentiable Digital Signal Processing
    Lamtharn (Hanoi) Hantrakul
    Chenjie Gu
    ICLR 2020 (2020)
    Preview abstract Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library will be made available upon paper acceptance and we encourage further contributions from the community and domain experts. View details
    Self-supervised Pitch Detection by Inverse Audio Synthesis
    Lamtharn (Hanoi) Hantrakul
    Rigel Jacob Swavely
    Curtis Glenn-Macway Hawthorne
    ICML 2020 Workshop on Self-supervision in Audio and Speech (2020) (to appear)
    Preview abstract Audio scene understanding, parsing sound into a hierarchy of meaningful parts, is an open problem in representation learning. Sound is a particularly challenging domain due to its high dimensionality, sequential dependencies and hierarchical structure. Differentiable Digital Signal Processing (DDSP) greatly simplifies the forward problem of generating audio by introducing differentiable synthesizer and effects modules that combine strong signal priors with end-to-end learning. Here, we focus on the inverse problem, inferring synthesis parameters to approximate an audio scene. We demonstrate that DDSP modules can enable a new approach to self-supervision, generating synthetic audio with differentiable synthesizers and training feature extractor networks to infer the synthesis parameters. By building a hierarchy from sinusoidal to harmonic representations, we show that it possible to use such an inverse modeling approach to disentangle pitch from timbre, an important task in audio scene understanding. View details
    Fast and Flexible Neural Audio Synthesis
    Lamtharn (Hanoi) Hantrakul
    Chenjie Gu
    ISMIR 2019 (2019) (to appear)
    Preview abstract Autoregressive neural networks, such as WaveNet, have opened up new avenues for expressive audio synthesis. High-quality speech synthesis utilizes detailed linguistic features for conditioning, but comparable levels control have yet to be realized for musical instruments. Here, we demonstrate an autoregressive model capable of synthesizing realistic audio that closely follows fine-scale temporal conditioning for loudness and fundamental frequency. We find the appropriate choice of conditioning features and architectures improves both the quantitative accuracy of audio resynthesis and qualitative responsiveness to creative manipulation of conditioning. While large autoregressive models generate audio much slower than realtime, we achieve these results with a much more efficient WaveRNN model, opening the door for exploring real-time interactive audio synthesis with neural networks. View details
    Bach Doodle: Approachable music composition with machine learning at scale
    Curtis Hawthorne
    Monica Dinculescu
    Leon Hong
    Jacob Howcroft
    Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) (2019)
    Preview abstract Many of us like music, but composing can feel intimidating, not knowing where to begin. Even when we have a melody, without sufficient skills in harmony we are deterred from developing it into a composition. Machine learning could potentially extend our creative abilities by offering generative models that can fill in the missing parts of our composition. To make music composition more approachable, we designed a composition web-app where users can create their own melody and have it harmonized by a machine learning model. For inputting melodies, we designed a simplified sheet music interface that facilitates easy trial and error, and found that users adapted to it quickly even when they were not familiar with western music notation. Users can rapidly explore different possibilities in harmonizations by tweaking their melody and requesting for new harmonizations. The harmonizations are provided by Coconet, a flexible generative model of counterpoint. Several technical challenges had to be overcome to support an interactive experience at scale. First, as most users do not have dedicated hardware to run machine learning models, we re-implemented Coconet in TensorFlow.js so that it could run in the browser. Second, our initial re-implementation took more than 40 seconds to generate two measures of music. By adopting dilated depth-wise separable convolutions and model quantization, we reduced it down to 2 seconds. Third, to prepare for large-scale deployment, we calibrated a speed test to determine if a user’s device is fast enough for running the model in the browser, if not the harmonization requests were sent to remote TPU servers. In three days, the web-app received more than 50 million queries for harmonization around the world. Users could choose to rate their compositions and contribute them to a public dataset, which we are releasing with this paper. We hope that the community might find this dataset useful for, ranging from ethnomusicological studies, to music education to improving machine learning models. We end with a quote from a user: “It's really fun to play with. This might be the first time in my life I feel competent at music.” View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    Preview abstract We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Though seq2seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix, Vid2Vid) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and learning to invert them has real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score). View details
    Preview abstract One of the areas of interest for music generative models is to empower individual expression. But how can a creator personalize a machine learning model to make it their own? Training a custom deep neural network model like Music Transformer, MusicVAE or SketchRNN from scratch requires significant amounts of data (millions of examples) and compute resources (specialized hardware like GPUs/TPUs) as well as expertise in hyper parameter tuning. Without sufficient data, models are either unable to produce realistic output (underfitting), or they memorize the training examples and are unable to generalize to produce varied outputs (overfitting) – it would be like trying to learn all of music theory from a single song. We introduce a new model for sample-efficient adaptation to user data, based on prior work by Engel et al [1]. We can quickly train this small, personalized model to control a much larger, more general pretrained latent variable model. This allows us to generate samples from only the portions of the latent space we are interested in without having to retrain the large model from scratch. We demonstrate this technique in an online demo, that lets users upload their own MIDI files (either melodies or multi-instrument songs) and generate samples that sound like their input. View details
    Preview abstract Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a lower-resource downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning for NLP by introducing a unified framework which casts every language problem as a text-to-text task. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of text understanding tasks. By combining the insights gained in our exploration with scale and a new giant unlabeled text dataset, we achieve state-of-the-art results in most of the tasks we consider. To facilitate future work on text understanding, we release our dataset, pre-trained models, and code. View details
    Magenta Studio: Augmenting Creativity with Deep Learning in Ableton Live
    Yotam Mann
    Jon Gillick
    Monica Dinculescu
    Carey Radebaugh
    Curtis Hawthorne
    Proceedings of the International Workshop on Musical Metacreation (MUME) (2019)
    Preview abstract The field of Musical Metacreation (MuMe) has pro-duced impressive results for both autonomous and in-teractive creativity. However, there are few examplesof these systems crossing over to the “mainstream” ofmusic creation and consumption. We tie together ex-isting frameworks (Electron, TensorFlow.js, and MaxFor Live) to develop a system whose purpose is tobring the promise of interactive MuMe to the realmof professional music creators. Combining compellingapplications of deep learning based music generationwith a focus on ease of installation and use in a pop-ular DAW, we hope to expose more musicians and pro-ducers to the potential of using such systems in theircreative workflows. Our suite of plug-ins for AbletonLive, named Magenta Studio, is available for downloadathttp://g.co/magenta/studioalong with itsopen source implementation. View details
    GANSynth: Adversarial Neural Audio Synthesis
    Kumar Krishna Agrawal
    Shuo Chen
    Ishaan Gulrajani
    Chris Donahue
    ICLR (2019)
    Preview abstract Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modelling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive emperical investigations on the difficult NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts. View details
    Preview abstract Creative generative machine learning interfaces are stronger when multiple actors bearing different points of view actively contribute to them. User experience (UX) research and design involvement in the creation of machine learning (ML) models help ML research scientists to more effectively identify human needs that ML models will fulfill. The People and AI Research (PAIR) group within Google developed a novel program method in which UXers are embedded into an ML research group for three months to provide a human-centered perspective on the creation of ML models. The first full-time cohort of UXers were embedded in a team of ML research scientists focused on deep generative models to assist in music composition. Here, we discuss the structure and goals of the program, challenges we faced during execution, and insights gained as a result of the process. We offer practical suggestions for how to foster communication between UX and ML research teams and recommended UX design processes for building creative generative machine learning interfaces. View details
    Onsets and Frames: Dual-Objective Piano Transcription
    Curtis Hawthorne
    Erich Elsen
    Jialin Song
    Ian Simon
    Colin Raffel
    Sageev Oore
    Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
    Preview abstract We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions. View details
    Preview abstract Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By post-hoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradient-based optimization or amortized actor functions. Combining attribute constraints with a universal “realism” constraint, which enforces similarity to the data distribution, we generate realistic conditional images from an unconditional variational autoencoder. Further, using gradient-based optimization, we demonstrate identity-preserving transformations that make the minimal adjustment in latent space to modify the attributes of an image. Finally, with discrete sequences of musical notes, we demonstrate zero-shot conditional generation, learning latent constraints in the absence of labeled data or a differentiable reward function. View details
    Magenta.js: A JavaScript API for Augmenting Creativity with Deep Learning
    Curtis Hawthorne
    Ian Simon
    Joint Workshop on Machine Learning for Music (ICML) (2018)
    Preview abstract New methods for modeling musical scores, often based on deep learning, have made it possible to automatically generate ever more convincing compositions. However, to enhance the creativity of a new generation of artists, they must be empowered to easily play and experiment with these models on their own terms. Application developers are already imagining ways to apply their design expertise to this new class of AI technologies but often lack a sufficient background in machine learning. Magenta.js is a new open source library with a simple JavaScript API intended to bridge this gap by abstracting away technical details, making it easier than ever for app developers to create new interfaces to generative models. Furthermore, because it is easily extensible, we hope that Magenta.js can foster a connection between the broader research community and creative developers through contributions from both groups. Finally, Magenta.js will open up the possibility for a new type of compositional tool that adapts to user preferences and behaviors in real-time. Code and documentation are available online at https://goo.gl/magenta/js. View details
    Preview abstract Advances in machine learning have the potential to radically reshape interactions between humans and computers. Deep learning makes it possible to discover powerful representations that are capable of capturing the latent structure of highdimensional data such as music. By creating interactive latent space “palettes” of musical sequences and timbres, we demonstrate interfaces for musical creation made possible by machine learning. We introduce an interface to the intuitive, low-dimensional control spaces for high-dimensional note sequences, allowing users to explore a compositional space of melodies or drum beats in a simple 2-D grid. Furthermore, users can define 1-D trajectories in the 2-D space for autonomous, continuous morphing during improvisation. Similarly for timbre, our interface to a learned latent space of audio provides an intuitive and smooth search space for morphing between the timbres of different instruments. We remove technical and computational barriers by embedding pre-trained networks into a browser-based GPU-accelerated framework, making the systems accessible to a wide range of users while maintaining potential for creative flexibility and personalization. View details
    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
    Colin Raffel
    Curtis Hawthorne
    International Conference on Machine Learning (ICML) (2018)
    Preview abstract The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at https://goo.gl/magenta/musicvae-code. View details
    Preview abstract In this work we develop recurrent variational autoencoders (VAEs) trained to reproduce short musical sequences and demonstrate their use as a creative device both via random sampling and data interpolation. Furthermore, by using a novel hierarchical decoder, we show that we are able to model long sequences with musical structure for both individual instruments and a three-piece band (lead, bass, and drums). Finally, we demonstrate the effectiveness of scheduled sampling in significantly improving our reconstruction accuracy. View details
    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
    Cinjon Resnick
    Sander Dieleman
    Karen Simonyan
    Mohammad Norouzi
    ICML (2017)
    Preview abstract Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive. View details
    Counterpoint by Convolution
    Tim Cooijmans
    Aaron Courville
    Proceedings of ISMIR 2017
    Preview abstract Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce COCONET, a convolutional neural network in the NADE family of generative models (Uria et al., 2016). Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation. View details
    Multi-Task Convolutional Music Models
    Cinjon Resnick
    Diego Ardila
    BayLearn (2016)
    Preview abstract The paper is itself a short abstract for BayLearn. View details
    Audio Deepdream: Optimizing raw audio with convolutional networks
    Cinjon Resnick
    Diego Ardila
    International Society for Music Information Retrieval Conference, Google Brain (2016)
    Preview abstract The hallucinatory images of DeepDream opened up the floodgates for a recent wave of artwork generated by neural networks. In this work, we take first steps to applying this to audio. We believe a key to solving this problem is training a deep neural network to perform a music perception task on raw audio. Consequently, we have followed in the footsteps of Van den Oord et al and trained a network to predict embeddings that were themselves the result of a collaborative filtering model. A key difference is that we learn features directly from the raw audio, which creates a chain of differentiable functions from raw audio to high level features. We then use gradient descent on the network to extract samples of "dreamed" audio. View details
    Streaming fragment assignment for real-time analysis of sequencing experiments.
    Lior Pachter
    Nature Methods, vol. 10 (2013)
    Ambiguous fragment assignment for high-throughput sequencing experiments.
    Ph.D. Thesis, University of California, Berkeley (2013)
    Improving RNA-Seq expression estimates by correcting for fragment bias
    Cole Trapnell
    Julie Donaghey
    John L Rinn
    Lior Pachter
    Genome Biology, vol. 12 (2010)