Jump to Content
Douglas Eck

Douglas Eck

Doug is a Senior Research Director at Google, and leads research efforts at Google DeepMind in Generative Media, including image, video, 3D, music and audio generation. He also leads a broader group active in areas including Fundamental Learning Algorithms, Natural Language Processing, Multimodal Learning, Reinforcement Learning, Computer Vision and Generative Models. His own research lies at the intersection of machine learning and human-computer interaction (HCI). Doug created Magenta, an ongoing research project exploring the role of AI in art and music creation. He is also an advocate for PAIR, a multidisciplinary team that explores the human side of AI through fundamental research, building tools, creating design frameworks, and working with diverse communities.

Before joining Google in 2010, Doug did research in music perception, aspects of music performance, machine learning for large audio datasets and music recommendation. He completed his PhD in Computer Science and Cognitive Science at Indiana University in 2000 and went on to a postdoctoral fellowship with Juergen Schmidhuber at IDSIA in Lugano Switzerland. From 2003-2010, Doug was faculty in Computer Science in the University of Montreal machine learning group (now MILA machine learning lab), where he became Associate Professor.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    PaLM: Scaling Language Modeling with Pathways
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts. We find that existing NLP datasets are highly repetitive and contain duplicated examples. For example, there is an example in the training dataset C4 that has over 200,000 near duplicates. As a whole, we find that 1.68% of the C4 are near-duplicates. Worse, we find a 1% overlap between the training and testing sets in these datasets. Duplicate examples in training data inappropriately biases the distribution of rare/common sequences. Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples. Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO. View details
    Preview abstract Joint attention — the ability to purposefully coordinate your attention with another person, and mutually attend to the same thing — is an important milestone in human cognitive development. In this paper, we ask whether joint attention can be useful as a mechanism for improving multi-agent coordination and social learning. We first develop deep reinforcement learning (RL) agents with a recurrent visual attention architecture. We then train agents to minimize the difference between the attention weights that they apply to the environment at each timestep, and the attention of other agents. Our results show that this joint attention incentive improves agents’ ability to solve difficult coordination tasks, by helping overcome the problem of exploring the combinatorial multi-agent action space. Joint attention leads to higher performance than a competitive centralized critic baseline across multiple environments. Further, we show that joint attention enhances agents’ ability to learn from experts present in their environment, even when performing single-agent tasks. Taken together, these findings suggest that joint attention may be a useful inductive bias for improving multi-agent learning. View details
    Preview abstract Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper investigates whether independent reinforcement learning (RL) agents in a multi-agent environment can learn to use social learning to improve their performance. We find that in most circumstances, vanilla model-free RL agents do not use social learning. We analyze the reasons for this deficiency, and show that by imposing constraints on the training environment and introducing a model-based auxiliary loss we are able to obtain generalized social learning policies which enable agents to: i) discover complex skills that are not learned from single-agent training, and ii) adapt online to novel environments by taking cues from experts present in the new environment. In contrast, agents trained with model-free RL or imitation learning generalize poorly and do not succeed in the transfer tasks. By mixing multi-agent and solo training, we can obtain agents that use social learning to gain skills that they can deploy when alone, even out-performing agents trained alone from the start. View details
    Automatic Detection of Generated Text is Easiest when Humans are Fooled
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 1808-1822
    Preview abstract Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies—top-_k_, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems. View details
    Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
    Jennifer McCleary
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Joint Conference on Artificial Intelligence (IJCAI) 2018 (2020), pp. 1-9
    Preview abstract A known deficit of modern machine learning (ML) and deep learning (DL) methodology is that models must be carefully fine-tuned in order to solve a particular task. Most algorithms cannot generalize well to even highly similar tasks, let alone exhibit signs of general artificial intelligence (AGI). To address this problem, researchers have explored developing loss functions that act as intrinsic motivators that could motivate an ML or DL agent to learn across a number of domains. This paper argues that an important and useful intrinsic motivator is that of social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, by optimizing the model to produce sketches that it predicts will lead to more positive facial expressions. We show in multiple independent evaluations that the model trained with facial feedback produced sketches that are more highly rated, and induce significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Towards Better Storylines with Sentence-Level Language Models
    David Grangier
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 1808-1822
    Preview abstract This work proposes a sentence-level language model which predicts the next sentence in a story given the embeddings of the previous sentences. The model operates at the sentence-level and selects the next sentence within a fine set of fluent alternatives. By working with sentence embeddings instead of word embeddings, our model is able to efficiently consider a large number of alternative sentences. By considering only fluent sentences, our model is relieved from modeling fluency and can focus on longer range dependencies. Our method achieves state-of-the-art accuracy on the StoryCloze task in the unsupervised setting. View details
    Unsupervised Hierarchical Story Infilling
    David Grangier
    Chris Callison-Burch
    NAACL 2019 Workshop on Narrative Understanding, Minneapolis, MN (2019)
    Preview abstract Story infilling involves predicting words to go into a missing span from a story. This challenging task has the potential to transform interactive tools for creative writing. However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity. View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    A Learned Representation of Scalable Vector Graphics
    Rapha Gontijo Lopes
    David Ha
    Jon Shlens
    ICCV (2019)
    Preview abstract Dramatic advances in generative models have resulted in near photographic quality for artificially rendered faces, animals and other objects in the natural world. In spite of such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we attempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts crawled from the web and highlight how such a model captures the statistical dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design. View details
    Preview abstract We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Though seq2seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix, Vid2Vid) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and learning to invert them has real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score). View details
    Preview abstract Creative generative machine learning interfaces are stronger when multiple actors bearing different points of view actively contribute to them. User experience (UX) research and design involvement in the creation of machine learning (ML) models help ML research scientists to more effectively identify human needs that ML models will fulfill. The People and AI Research (PAIR) group within Google developed a novel program method in which UXers are embedded into an ML research group for three months to provide a human-centered perspective on the creation of ML models. The first full-time cohort of UXers were embedded in a team of ML research scientists focused on deep generative models to assist in music composition. Here, we discuss the structure and goals of the program, challenges we faced during execution, and insights gained as a result of the process. We offer practical suggestions for how to foster communication between UX and ML research teams and recommended UX design processes for building creative generative machine learning interfaces. View details
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    Magenta Studio: Augmenting Creativity with Deep Learning in Ableton Live
    Yotam Mann
    Jon Gillick
    Monica Dinculescu
    Carey Radebaugh
    Curtis Hawthorne
    Proceedings of the International Workshop on Musical Metacreation (MUME) (2019)
    Preview abstract The field of Musical Metacreation (MuMe) has pro-duced impressive results for both autonomous and in-teractive creativity. However, there are few examplesof these systems crossing over to the “mainstream” ofmusic creation and consumption. We tie together ex-isting frameworks (Electron, TensorFlow.js, and MaxFor Live) to develop a system whose purpose is tobring the promise of interactive MuMe to the realmof professional music creators. Combining compellingapplications of deep learning based music generationwith a focus on ease of installation and use in a pop-ular DAW, we hope to expose more musicians and pro-ducers to the potential of using such systems in theircreative workflows. Our suite of plug-ins for AbletonLive, named Magenta Studio, is available for downloadathttp://g.co/magenta/studioalong with itsopen source implementation. View details
    Preview abstract Advances in machine learning have the potential to radically reshape interactions between humans and computers. Deep learning makes it possible to discover powerful representations that are capable of capturing the latent structure of highdimensional data such as music. By creating interactive latent space “palettes” of musical sequences and timbres, we demonstrate interfaces for musical creation made possible by machine learning. We introduce an interface to the intuitive, low-dimensional control spaces for high-dimensional note sequences, allowing users to explore a compositional space of melodies or drum beats in a simple 2-D grid. Furthermore, users can define 1-D trajectories in the 2-D space for autonomous, continuous morphing during improvisation. Similarly for timbre, our interface to a learned latent space of audio provides an intuitive and smooth search space for morphing between the timbres of different instruments. We remove technical and computational barriers by embedding pre-trained networks into a browser-based GPU-accelerated framework, making the systems accessible to a wide range of users while maintaining potential for creative flexibility and personalization. View details
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that function as intrinsic motivators in the absence of external rewards. This paper takes the position that current research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for more rapid learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more smiling and less frowning than the baseline. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Preview abstract We present sketch-rnn, a recurrent neural network (RNN) able to construct stroke-based drawings of common objects. The model is trained on thousands of crude human-drawn images representing hundreds of classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format. View details
    Visualizing Music Self-Attention
    Monica Dinculescu
    Ashish Vaswani
    NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language (2018)
    Preview abstract Like language, music can be represented as a sequence of discrete symbols that form a hierarchical syntax, with notes being roughly like characters and motifs of notes like words. Unlike text however, music relies heavily on repetition on multiple timescales to build structure and meaning. The Music Transformer has shown compelling results in generating music with structure~\citep{huang2018music}. In this paper, we introduce a tool for visualizing self-attention on polyphonic music with an interactive pianoroll. We use music transformer as both a descriptive tool and a generative model. For the former, we use it to analyze existing music to see if the resulting self-attention structure corroborates with the musical structure known from music theory. For the latter, we inspect the model's self-attention during generation, in order to understand how past notes affect future ones. We also compare and contrast the attention structure of regular attention to that of relative attention \citep{shaw2018self, huang2018music}, and examine its impact on the resulting generated music. For example, for the JSB Chorales dataset, a model trained with relative attention is more consistent in attending to all the voices in the preceding timestep and the chords before, and at cadences to the beginning of a phrase, allowing it to create an arc. We hope that our analyses will offer more evidence for relative self-attention as a powerful inductive bias for modeling music. We invite the reader to checkout video animations of music attention and interact with the visualizations at \url{https://storage.googleapis.com/nips-workshop-visualization/index.html}. View details
    Preview abstract We argue for the benefit of designing deep generative models through mixed-initiative combinations of deep learning algorithms and human specifications for authoring sequential content, such as stories and music. Sequence models have shown increasingly convincing results in domains such as auto-completion, speech to text, and translation; however, longer-term structure remains a major challenge. Given lengthy inputs and outputs, deep generative systems still lack reliable representations of beginnings, middles, and ends, which are standard aspects of creating content in domains such as music composition. This paper aims to contribute a framework for mixed-initiative learning approaches, specifically for creative deep generative systems, and presents a case study of a deep generative model for music, Counterpoint by Convolutional Neural Network (Coconet). View details
    Onsets and Frames: Dual-Objective Piano Transcription
    Curtis Hawthorne
    Erich Elsen
    Jialin Song
    Colin Raffel
    Sageev Oore
    Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
    Preview abstract We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions. View details
    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
    Colin Raffel
    Curtis Hawthorne
    International Conference on Machine Learning (ICML) (2018)
    Preview abstract The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at https://goo.gl/magenta/musicvae-code. View details
    Learning via social awareness: improving sketch representations with facial feedback
    Natasha Jaques
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Conference on Learning Representations (2018)
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Tuning Recurrent Neural Networks With Reinforcement Learning
    Natasha Jaques
    Shixiang Gu
    Dzmitry Bahdanau
    Jose Miguel Hernandez Lobato
    Richard E. Turner
    ICLR Workshop (2017)
    Preview abstract This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data. View details
    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
    Cinjon Resnick
    Sander Dieleman
    Karen Simonyan
    Mohammad Norouzi
    ICML (2017)
    Preview abstract Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive. View details
    Deep Music: Towards Musical Dialogue
    Mason Bretan
    Sageev Oore
    Larry Heck
    AAAI, AAAI, AAAI (2017)
    Preview abstract Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time applica- tion the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker. View details
    Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
    Natasha Jaques
    Shixiang Gu
    Dzmitry Bahdanau
    José Miguel Hernández-Lobato
    Richard E. Turner
    ICML (2017)
    Preview abstract This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data. View details
    Learning to Create Piano Performances
    Sageev Oore
    Sander Dieleman
    NIPS 2017 Workshop on Machine Learning and Creativity
    Preview abstract Nearly all previous work on music generation has focused on creating pieces that are, effectively, musical scores. In contrast, we learn to create piano performances: besides predicting the notes to be played, we also predict expressive variations in the timing and musical dynamics (loudness). We provided samples generated by our system for informal feedback to a set of professional musicians and composers, and the samples were well-received. Overall, the comments indicate that our system is generating music that, while lacking high-level structure, does indeed sound very much like human performance, and is closely reminiscent of the classical piano repertoire. View details
    Preview abstract Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models. View details
    Preview abstract In this work we develop recurrent variational autoencoders (VAEs) trained to reproduce short musical sequences and demonstrate their use as a creative device both via random sampling and data interpolation. Furthermore, by using a novel hierarchical decoder, we show that we are able to model long sequences with musical structure for both individual instruments and a three-piece band (lead, bass, and drums). Finally, we demonstrate the effectiveness of scheduled sampling in significantly improving our reconstruction accuracy. View details
    Improving image generative models with human interactions
    Andrew Lampinen
    David Richard So
    Fred Bertsch
    arXiv (2017)
    Preview abstract GANs provide a framework for training generative models which mimic a data distribution. However, in many cases we wish to train a generative model to optimize some auxiliary objective function within the data it generates, such as making more aesthetically pleasing images. In some cases, these objective functions are difficult to evaluate, e.g. they may require human interaction. Here, we develop a system for efficiently training a GAN to increase a generic rate of positive user interactions, which could represent aesthetic ratings or any other objective. To do this, we build a model of human behavior in the targeted domain from a relatively small set of interactions, and then use this behavioral model as an auxiliary loss function to improve the generative model. As a proof of concept, we demonstrate that this system is successful at improving positive interaction rates simulated from a variety of objectives, and characterize some factors that affect its performance. View details
    Counterpoint by Convolution
    Tim Cooijmans
    Aaron Courville
    Proceedings of ISMIR 2017
    Preview abstract Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce COCONET, a convolutional neural network in the NADE family of generative models (Uria et al., 2016). Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation. View details
    Tuning Recurrent Neural Networks with Reinforcement Learning
    Shixiang Shane Gu
    Richard E. Turner
    Proceedings of the International Conference on Learning Representations (ICLR) (2016)
    Preview abstract The approach of training sequence models using supervised learning and next-step prediction suffers from known failure modes. For example, it is notoriously difficult to ensure multi-step generated sequences have coherent global structure. We propose a novel sequence-learning approach in which we use a pre-trained Recurrent Neural Network (RNN) to supply part of the reward value in a Reinforcement Learning (RL) model. Thus, we can refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from KL control. We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data. View details
    Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning
    Natasha Jaques
    Shixiang Gu
    Richard E. Turner
    Deep Reinforcement Learning Workshop, NIPS (2016)
    Preview abstract Supervised learning with next-step prediction is a common way to train a sequence prediction model; however, it suffers from known failure modes and is notoriously difficult to train models to learn certain properties, such as having a coherent global structure. Reinforcement learning can be used to impose arbitrary properties on generated data by choosing appropriate reward functions. In this paper we propose a novel approach for sequence training, where we refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from stochastic optimal control (SOC). We explore the usefulness of our approach in the context of music gener- ation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using RL, where the reward function is a combination of rewards based on rules of music theory, as well as the output of another trained Note-RNN. We show that this combination of ML and RL can not only produce more pleasing melodies, but that it can significantly reduce unwanted behaviors and failure modes of the RNN. View details
    Audio Deepdream: Optimizing raw audio with convolutional networks
    Cinjon Resnick
    Diego Ardila
    International Society for Music Information Retrieval Conference, Google Brain (2016)
    Preview abstract The hallucinatory images of DeepDream opened up the floodgates for a recent wave of artwork generated by neural networks. In this work, we take first steps to applying this to audio. We believe a key to solving this problem is training a deep neural network to perform a music perception task on raw audio. Consequently, we have followed in the footsteps of Van den Oord et al and trained a network to predict embeddings that were themselves the result of a collaborative filtering model. A key difference is that we learn features directly from the raw audio, which creates a chain of differentiable functions from raw audio to high level features. We then use gradient descent on the network to extract samples of "dreamed" audio. View details
    Multi-Task Convolutional Music Models
    Cinjon Resnick
    Diego Ardila
    BayLearn (2016)
    Preview abstract The paper is itself a short abstract for BayLearn. View details
    Building Musically-relevant Audio Features through Multiple Timescale Representations
    Yoshua Bengio
    Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal (2012)
    Preview
    The Need for Music Information Retrieval with User-Centered and Multimodal Strategies
    Cynthia C.S. Liem
    Meinard Müller
    George Tzanetakis
    Alan Hanjalic
    MIRUM '11, ACM, Scottsdale, Arizona (2011), pp. 1-6
    Preview abstract Music is a widely enjoyed content type, existing in many multifaceted representations. With the digital information age, a lot of digitized music information has theoretically become available at the user’s fingertips. However, the abundance of information is too large-scaled and too diverse to annotate, oversee and present in a consistent and human manner, motivating the development of automated Music Information Retrieval (Music-IR) techniques. In this paper, we encourage to consider music content beyond a monomodal audio signal and argue that Music-IR approaches with multimodal and user-centered strategies are necessary to serve reallife usage patterns and maintain and improve accessibility of digital music data. After discussing relevant existing work in these directions, we show that the field of Music-IR faces similar challenges as neighboring fields, and thus suggest opportunities for joint collaboration and mutual inspiration. View details
    Temporal pooling and multiscale learning for automatic annotation and ranking of music audio
    Simon Lemieux
    Yoshua Bengio
    International Society for Music Information Retrieval (ISMIR 2011)
    Preview
    Probabilistic Models for Melodic Prediction
    Jean-Francois Paiement
    Samy Bengio
    Artificial Intelligence Journal, vol. 173 (2009), pp. 1266-1274
    Preview abstract Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments. View details
    A Distance Model for Rhythms
    Jean-Francois Paiement
    Yves Grandvalet
    Samy Bengio
    International Conference on Machine Learning (ICML) (2008)
    Preview abstract Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. View details
    A Generative Model for Rhythms
    Jean-Francois Paiement
    Samy Bengio
    Yves Grandvalet
    Neural Information Processing Systems, Workshop on Brain, Music and Cognition (2008)
    Preview abstract Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. View details
    A Generative Model for Distance Patterns in Music
    Jean-Francois Paiement
    Yves Grandvalet
    Samy Bengio
    NIPS Workshop on Music, Brain and Cognition (2007)
    Preview abstract In order to cope for the difficult problem of long term dependencies in sequential data in general, and in musical data in particular, a generative model for distance patterns especially designed for music is introduced. A specific implementation of the model when considering Hamming distances over rhythms is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy over two different music databases. View details
    Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to musical acoustic study
    B. Leonard
    G. Sikora
    M. De Francisco
    129th Audio Engineering Society (AES) Convention, London (2010)
    An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
    A. Courville
    Y. Bengio
    Neural Information Processing Systems Conference 22 (NIPS'09) (2010)
    Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to to musical acoustic study
    B. Leonard
    G. Sikora
    M. De Francisco
    129th Audio Engineering Society (AES) Convention, London (2010)
    Steerable Playlist Generation by Learning Song Similarity from Radio Station Playlists
    F. Maillet
    G. Desjardins
    P. Lamere
    Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
    Towards a musical beat emphasis function
    M. Davies
    M. Plumbley
    Proceedings of IEEE WASPAA, New Paltz, NY (2009)
    Automatic identification of instrument classes in polyphonic and poly-instrument audio
    P. Hamel
    S. Wood
    Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
    Autotagger: A Model For Predicting Social Tags from Acoustic Features on Large Music Databases
    T. Bertin-Mahieux
    F. Maillet
    P. Lamere
    Journal of New Music Research, vol. 37 (2008), pp. 115-135
    On the use of Sparse Time Relative Auditory Codes for Music
    P-A. Manzagol
    T. Bertin-Mahieux
    Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008)
    A generative model for rhythms
    {J.-F.} Paiement
    Y. Grandvalet
    S. Bengio
    ICML '08: Proceedings of the 25th International Conference on Machine Learning (2008)
    Automatic generation of social tags for music recommendation
    P. Lamere
    T. Bertin-Mahieux
    S. Green
    Neural Information Processing Systems Conference 20 (NIPS'07) (2008)
    A Supervised Classification Algorithm For Note Onset Detection
    A. Lacoste
    EURASIP Journal on Applied Signal Processing, vol. 2007 (2007), pp. 1-13
    Can't get you out of my head: A connectionist model of cyclic rehearsal
    H. Jaeger
    Modeling Communications with Robots and Virtual Humans, Springer-Verlag (2007)
    Autotagging music using supervised machine learning
    T. Bertin-Mahieux
    P. Lamere
    Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
    Using 3D Visualizations to Explore and Discover Music
    P. Lamere
    Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
    Beat Tracking Using an Autocorrelation Phase Matrix
    Proceedings of the 2007 International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, pp. 1313-1316
    Probabilistic Melodic Harmonization
    J.-F. Paiement
    S. Bengio
    Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, Springer-Verlag (2006), pp. 218-229
    Beat Induction Using an Autocorrelation Phase Matrix
    The Proceedings of the 9th International Conference on Music Perception and Cognition (ICMPC9), Causal Productions (2006), pp. 931-932
    Finding Long-Timescale Musical Structure with an Autocorrelation Phase Matrix
    Music Perception, vol. 24 (2006), pp. 167-176
    Aggregate Features and AdaBoost for Music Classification
    J. Bergstra
    N. Casagrande
    D. Erhan
    B. Kégl
    Machine Learning, vol. 65 (2006), pp. 473-484
    Predicting genre labels for artists using FreeDB
    J. Bergstra
    A. Lacoste
    Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), pp. 85-88
    Probabilistic Melodic Harmonization
    {J.-F.} Paiement
    S. Bengio
    Canadian Conference on AI, Springer (2006), pp. 218-229
    A Probabilistic Model for Chord Progressions
    J.-F. Paiement
    S. Bengio
    International Conference on Music Information Retrieval, ISMIR (2005)
    A Graphical Model for Chord Progressions Embedded in a Psychoacoustic Space
    J.-F. Paiement
    S. Bengio
    D. Barber
    International Conference on Machine Learning, ICML (2005)
    Editorial: New Research in Rhythm Perception and Production
    S. K. Scott
    Music Perception, vol. 22 (2005), pp. 371-388
    A Probabilistic Model for Chord Progressions
    {J.-F.} Paiement
    S. Bengio
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 312-319
    Geometry in Sound: A Speech/Music Audio Classifier Inspired by an Image Classifier
    N. Casagrande
    B. Kegl
    Proceedings of the International Computer Music Conference (ICMC) (2005), pp. 207-210
    Music Perception, Guest Editor, Special Issue on Rhythm Perception and Production
    S. K. Scott
    Music Perception, vol. 22 (3) (2005)
    Frame-Level Audio Feature Extraction using AdaBoost
    N. Casagrande
    B. Kégl
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 345-350
    A graphical model for chord progressions embedded in a psychoacoustic space
    {J.-F.} Paiement
    S. Bengio
    D. Barber
    ICML '05: Proceedings of the 22nd international conference on Machine learning, ACM Press, New York, NY, USA (2005), pp. 641-648
    Finding Meter in Music Using an Autocorrelation Phase Matrix and Shannon Entropy
    N. Casagrande
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 504-509
    Biologically Plausible Speech Recognition with LSTM Neural Nets
    A. Graves
    N. Beringer
    J. Schmidhuber
    Proceedings of the First Int'l Workshop on Biologically Inspired Approaches to Advanced Information Technology (Bio-ADIT) (2004), pp. 127-136
    A Machine-Learning Approach to Musical Sequence Induction That Uses Autocorrelation to Bridge Long Timelags
    The Proceedings of the Eighth International Conference on Music Perception and Cognition (ICMPC8), Causal Productions, Adelaide (2004), pp. 542-543
    Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets
    J.A. Pérez-Ortiz
    F. A. Gers
    J. Schmidhuber
    Neural Networks, vol. 16 (2003), pp. 241-250
    Learning The Long-Term Structure of the Blues
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 284-289
    DEKF-LSTM
    F.A. Gers
    J.A. Perez-Ortiz
    J. Schmidhuber
    Proceedings of the 10th European Symposium on Artificial Neural Networks, ESANN 2002
    Finding Temporal Structure in Music: Blues Improvisation with LSTM Recurrent Networks
    J. Schmidhuber
    Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, IEEE, New York, pp. 747-756
    Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM
    J. Schmidhuber
    F.A. Gers
    Neural Computation, vol. 14 (2002), pp. 2039-2041
    Finding Downbeats with a Relaxation Oscillator
    Psychological Research, vol. 66 (2002), pp. 18-25
    Learning Context Sensitive Languages with LSTM Trained with Kalman Filters
    F.A. Gers
    J.A. Pérez-Ortiz
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 655-660
    Improving Long-Term Online Prediction with Decoupled Extended Kalman Filters
    J.A. Pérez-Ortiz
    J. Schmidhuber
    F.A. Gers
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 1055-1060
    A Network of Relaxation Oscillators that Finds Downbeats in Rhythms
    Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 1239-1247
    Applying LSTM to Time Series Predictable Through Time-Window Approaches
    F. A. Gers
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 669-676
    A Positive-Evidence Model for Rhythmical Beat Induction
    Journal of New Music Research, vol. 30 (2001), pp. 187-200
    Meter Through Synchrony: Processing Rhythmical Patterns with Relaxation Oscillators
    Ph.D. Thesis, Indiana University, Bloomington, IN (2000)
    Dynamics and Embodiment in Beat Induction
    M. Gasser
    Robert Port
    Rhythm Perception and Production, Swets and Zeitlinger, Lisse, The Netherlands (2000), pp. 157-170
    Meter as Mechanism: A Neural Network Model that Learns Metrical patterns
    M. Gasser
    R. Port
    Connect. Sci., vol. 11, no. 2 (1999), pp. 187-216
    Learning Simple Metrical Preferences in a Network of Fitzhugh-Nagumo Oscillators
    The Proceedings of the Twenty-First Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey (1999)
    An Exploration of Representational Complexity via Coupled Oscillators
    T. Chemero
    Proceedings of the Tenth Midwest Artificial Intelligence and Cognitive Science Society, MIT Press, Cambridge, Mass. (1999)
    Perception of Simple Rhythmic Patterns in a Network of Oscillators
    M. Gasser
    The Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey (1996)
    Representing Rhythmic Patterns in a Network of Oscillators
    M. Gasser
    The Proceedings of the International Conference on Music Perception and Cognition, Lawrence Erlbaum Associates, New Jersey (1996), pp. 361-366