Douglas Eck

Douglas Eck

Doug is a Senior Research Director at Google, and leads research efforts at Google DeepMind in Generative Media, including image, video, 3D, music and audio generation. He also leads a broader group active in areas including Fundamental Learning Algorithms, Natural Language Processing, Multimodal Learning, Reinforcement Learning, Computer Vision and Generative Models. His own research lies at the intersection of machine learning and human-computer interaction (HCI). Doug created Magenta, an ongoing research project exploring the role of AI in art and music creation. He is also an advocate for PAIR, a multidisciplinary team that explores the human side of AI through fundamental research, building tools, creating design frameworks, and working with diverse communities.

Before joining Google in 2010, Doug did research in music perception, aspects of music performance, machine learning for large audio datasets and music recommendation. He completed his PhD in Computer Science and Cognitive Science at Indiana University in 2000 and went on to a postdoctoral fellowship with Juergen Schmidhuber at IDSIA in Lugano Switzerland. From 2003-2010, Doug was faculty in Computer Science in the University of Montreal machine learning group (now MILA machine learning lab), where he became Associate Professor.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Slav Petrov
    arxiv:2204.02311(2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Deduplicating Training Data Makes Language Models Better
    Andrew Nystrom
    Chiyuan Zhang
    Chris Callison-Burch
    Nicholas Carlini
    (2022) (to appear)
    Preview abstract As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts. We find that existing NLP datasets are highly repetitive and contain duplicated examples. For example, there is an example in the training dataset C4 that has over 200,000 near duplicates. As a whole, we find that 1.68% of the C4 are near-duplicates. Worse, we find a 1% overlap between the training and testing sets in these datasets. Duplicate examples in training data inappropriately biases the distribution of rare/common sequences. Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples. Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO. View details
    Emergent Social Learning via Multi-agent Reinforcement Learning
    Kamal Ndousse
    Sergey Levine
    Natasha Jaques
    International Conference on Machine Learning (ICML)(2021)
    Preview abstract Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper investigates whether independent reinforcement learning (RL) agents in a multi-agent environment can learn to use social learning to improve their performance. We find that in most circumstances, vanilla model-free RL agents do not use social learning. We analyze the reasons for this deficiency, and show that by imposing constraints on the training environment and introducing a model-based auxiliary loss we are able to obtain generalized social learning policies which enable agents to: i) discover complex skills that are not learned from single-agent training, and ii) adapt online to novel environments by taking cues from experts present in the new environment. In contrast, agents trained with model-free RL or imitation learning generalize poorly and do not succeed in the transfer tasks. By mixing multi-agent and solo training, we can obtain agents that use social learning to gain skills that they can deploy when alone, even out-performing agents trained alone from the start. View details
    Joint Attention for Multi-Agent Coordination and Social Learning
    Dennis Lee
    Natasha Jaques
    Jiaxing Wu
    Dale Schuurmans
    ICRA Workshop on Social Intelligence in Humans and Robots(2021)
    Preview abstract Joint attention — the ability to purposefully coordinate your attention with another person, and mutually attend to the same thing — is an important milestone in human cognitive development. In this paper, we ask whether joint attention can be useful as a mechanism for improving multi-agent coordination and social learning. We first develop deep reinforcement learning (RL) agents with a recurrent visual attention architecture. We then train agents to minimize the difference between the attention weights that they apply to the environment at each timestep, and the attention of other agents. Our results show that this joint attention incentive improves agents’ ability to solve difficult coordination tasks, by helping overcome the problem of exploring the combinatorial multi-agent action space. Joint attention leads to higher performance than a competitive centralized critic baseline across multiple environments. Further, we show that joint attention enhances agents’ ability to learn from experts present in their environment, even when performing single-agent tasks. Taken together, these findings suggest that joint attention may be a useful inductive bias for improving multi-agent learning. View details
    Automatic Detection of Generated Text is Easiest when Humans are Fooled
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020), pp. 1808-1822
    Preview abstract Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies—top-_k_, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems. View details
    Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
    Natasha Jaques
    Jennifer McCleary
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Joint Conference on Artificial Intelligence (IJCAI) 2018(2020), pp. 1-9
    Preview abstract A known deficit of modern machine learning (ML) and deep learning (DL) methodology is that models must be carefully fine-tuned in order to solve a particular task. Most algorithms cannot generalize well to even highly similar tasks, let alone exhibit signs of general artificial intelligence (AGI). To address this problem, researchers have explored developing loss functions that act as intrinsic motivators that could motivate an ML or DL agent to learn across a number of domains. This paper argues that an important and useful intrinsic motivator is that of social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, by optimizing the model to produce sketches that it predicts will lead to more positive facial expressions. We show in multiple independent evaluations that the model trained with facial feedback produced sketches that are more highly rated, and induce significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Towards Better Storylines with Sentence-Level Language Models
    David Grangier
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020), pp. 1808-1822
    Preview abstract This work proposes a sentence-level language model which predicts the next sentence in a story given the embeddings of the previous sentences. The model operates at the sentence-level and selects the next sentence within a fine set of fluent alternatives. By working with sentence embeddings instead of word embeddings, our model is able to efficiently consider a large number of alternative sentences. By considering only fluent sentences, our model is relieved from modeling fluency and can focus on longer range dependencies. Our method achieves state-of-the-art accuracy on the StoryCloze task in the unsupervised setting. View details
    Magenta Studio: Augmenting Creativity with Deep Learning in Ableton Live
    Yotam Mann
    Jon Gillick
    Monica Dinculescu
    Carey Radebaugh
    Curtis Hawthorne
    Proceedings of the International Workshop on Musical Metacreation (MUME)(2019)
    Preview abstract The field of Musical Metacreation (MuMe) has pro-duced impressive results for both autonomous and in-teractive creativity. However, there are few examplesof these systems crossing over to the “mainstream” ofmusic creation and consumption. We tie together ex-isting frameworks (Electron, TensorFlow.js, and MaxFor Live) to develop a system whose purpose is tobring the promise of interactive MuMe to the realmof professional music creators. Combining compellingapplications of deep learning based music generationwith a focus on ease of installation and use in a pop-ular DAW, we hope to expose more musicians and pro-ducers to the potential of using such systems in theircreative workflows. Our suite of plug-ins for AbletonLive, named Magenta Studio, is available for downloadathttp://g.co/magenta/studioalong with itsopen source implementation. View details
    Preview abstract We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Though seq2seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix, Vid2Vid) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and learning to invert them has real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score). View details
    Music Transformer: Generating Music with Long-Term Structure
    Ashish Vaswani
    Jakob Uszkoreit
    Noam Shazeer
    Ian Simon
    Curtis Hawthorne
    Monica Dinculescu
    ICLR(2019)
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    A Learned Representation of Scalable Vector Graphics
    Rapha Gontijo Lopes
    David Ha
    Jon Shlens
    ICCV(2019)
    Preview abstract Dramatic advances in generative models have resulted in near photographic quality for artificially rendered faces, animals and other objects in the natural world. In spite of such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we attempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts crawled from the web and highlight how such a model captures the statistical dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design. View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    Unsupervised Hierarchical Story Infilling
    David Grangier
    Chris Callison-Burch
    NAACL 2019 Workshop on Narrative Understanding, Minneapolis, MN(2019)
    Preview abstract Story infilling involves predicting words to go into a missing span from a story. This challenging task has the potential to transform interactive tools for creative writing. However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity. View details
    Preview abstract Creative generative machine learning interfaces are stronger when multiple actors bearing different points of view actively contribute to them. User experience (UX) research and design involvement in the creation of machine learning (ML) models help ML research scientists to more effectively identify human needs that ML models will fulfill. The People and AI Research (PAIR) group within Google developed a novel program method in which UXers are embedded into an ML research group for three months to provide a human-centered perspective on the creation of ML models. The first full-time cohort of UXers were embedded in a team of ML research scientists focused on deep generative models to assist in music composition. Here, we discuss the structure and goals of the program, challenges we faced during execution, and insights gained as a result of the process. We offer practical suggestions for how to foster communication between UX and ML research teams and recommended UX design processes for building creative generative machine learning interfaces. View details
    Preview abstract We present sketch-rnn, a recurrent neural network (RNN) able to construct stroke-based drawings of common objects. The model is trained on thousands of crude human-drawn images representing hundreds of classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format. View details
    Visualizing Music Self-Attention
    Monica Dinculescu
    Ashish Vaswani
    NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language(2018)
    Preview abstract Like language, music can be represented as a sequence of discrete symbols that form a hierarchical syntax, with notes being roughly like characters and motifs of notes like words. Unlike text however, music relies heavily on repetition on multiple timescales to build structure and meaning. The Music Transformer has shown compelling results in generating music with structure~\citep{huang2018music}. In this paper, we introduce a tool for visualizing self-attention on polyphonic music with an interactive pianoroll. We use music transformer as both a descriptive tool and a generative model. For the former, we use it to analyze existing music to see if the resulting self-attention structure corroborates with the musical structure known from music theory. For the latter, we inspect the model's self-attention during generation, in order to understand how past notes affect future ones. We also compare and contrast the attention structure of regular attention to that of relative attention \citep{shaw2018self, huang2018music}, and examine its impact on the resulting generated music. For example, for the JSB Chorales dataset, a model trained with relative attention is more consistent in attending to all the voices in the preceding timestep and the chords before, and at cadences to the beginning of a phrase, allowing it to create an arc. We hope that our analyses will offer more evidence for relative self-attention as a powerful inductive bias for modeling music. We invite the reader to checkout video animations of music attention and interact with the visualizations at \url{https://storage.googleapis.com/nips-workshop-visualization/index.html}. View details
    Preview abstract We argue for the benefit of designing deep generative models through mixed-initiative combinations of deep learning algorithms and human specifications for authoring sequential content, such as stories and music. Sequence models have shown increasingly convincing results in domains such as auto-completion, speech to text, and translation; however, longer-term structure remains a major challenge. Given lengthy inputs and outputs, deep generative systems still lack reliable representations of beginnings, middles, and ends, which are standard aspects of creating content in domains such as music composition. This paper aims to contribute a framework for mixed-initiative learning approaches, specifically for creative deep generative systems, and presents a case study of a deep generative model for music, Counterpoint by Convolutional Neural Network (Coconet). View details
    Preview abstract Advances in machine learning have the potential to radically reshape interactions between humans and computers. Deep learning makes it possible to discover powerful representations that are capable of capturing the latent structure of highdimensional data such as music. By creating interactive latent space “palettes” of musical sequences and timbres, we demonstrate interfaces for musical creation made possible by machine learning. We introduce an interface to the intuitive, low-dimensional control spaces for high-dimensional note sequences, allowing users to explore a compositional space of melodies or drum beats in a simple 2-D grid. Furthermore, users can define 1-D trajectories in the 2-D space for autonomous, continuous morphing during improvisation. Similarly for timbre, our interface to a learned latent space of audio provides an intuitive and smooth search space for morphing between the timbres of different instruments. We remove technical and computational barriers by embedding pre-trained networks into a browser-based GPU-accelerated framework, making the systems accessible to a wide range of users while maintaining potential for creative flexibility and personalization. View details
    Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
    Natasha Jaques
    Jennifer McCleary
    David Ha
    Fred Bertsch
    Rosalind Picard
    ICLR 2018 Workshop
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that function as intrinsic motivators in the absence of external rewards. This paper takes the position that current research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for more rapid learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more smiling and less frowning than the baseline. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Onsets and Frames: Dual-Objective Piano Transcription
    Curtis Hawthorne
    Erich Elsen
    Jialin Song
    Ian Simon
    Colin Raffel
    Sageev Oore
    Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
    Preview abstract We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions. View details
    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
    Colin Raffel
    Curtis Hawthorne
    International Conference on Machine Learning (ICML)(2018)
    Preview abstract The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at https://goo.gl/magenta/musicvae-code. View details
    Learning via social awareness: improving sketch representations with facial feedback
    Natasha Jaques
    David Ha
    Fred Bertsch
    Rosalind Picard
    International Conference on Learning Representations(2018)
    Preview abstract In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model. View details
    Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
    Natasha Jaques
    Shixiang Gu
    Dzmitry Bahdanau
    José Miguel Hernández-Lobato
    Richard E. Turner
    ICML(2017)
    Preview abstract This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data. View details
    Tuning Recurrent Neural Networks With Reinforcement Learning
    Natasha Jaques
    Shixiang Gu
    Dzmitry Bahdanau
    Jose Miguel Hernandez Lobato
    Richard E. Turner
    ICLR Workshop(2017)
    Preview abstract This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data. View details
    Counterpoint by Convolution
    Tim Cooijmans
    Aaron Courville
    Proceedings of ISMIR 2017
    Preview abstract Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce COCONET, a convolutional neural network in the NADE family of generative models (Uria et al., 2016). Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation. View details
    Learning to Create Piano Performances
    Sageev Oore
    Ian Simon
    Sander Dieleman
    NIPS 2017 Workshop on Machine Learning and Creativity
    Preview abstract Nearly all previous work on music generation has focused on creating pieces that are, effectively, musical scores. In contrast, we learn to create piano performances: besides predicting the notes to be played, we also predict expressive variations in the timing and musical dynamics (loudness). We provided samples generated by our system for informal feedback to a set of professional musicians and composers, and the samples were well-received. Overall, the comments indicate that our system is generating music that, while lacking high-level structure, does indeed sound very much like human performance, and is closely reminiscent of the classical piano repertoire. View details
    Improving image generative models with human interactions
    Andrew Lampinen
    David Richard So
    Fred Bertsch
    arXiv(2017)
    Preview abstract GANs provide a framework for training generative models which mimic a data distribution. However, in many cases we wish to train a generative model to optimize some auxiliary objective function within the data it generates, such as making more aesthetically pleasing images. In some cases, these objective functions are difficult to evaluate, e.g. they may require human interaction. Here, we develop a system for efficiently training a GAN to increase a generic rate of positive user interactions, which could represent aesthetic ratings or any other objective. To do this, we build a model of human behavior in the targeted domain from a relatively small set of interactions, and then use this behavioral model as an auxiliary loss function to improve the generative model. As a proof of concept, we demonstrate that this system is successful at improving positive interaction rates simulated from a variety of objectives, and characterize some factors that affect its performance. View details
    Preview abstract In this work we develop recurrent variational autoencoders (VAEs) trained to reproduce short musical sequences and demonstrate their use as a creative device both via random sampling and data interpolation. Furthermore, by using a novel hierarchical decoder, we show that we are able to model long sequences with musical structure for both individual instruments and a three-piece band (lead, bass, and drums). Finally, we demonstrate the effectiveness of scheduled sampling in significantly improving our reconstruction accuracy. View details
    Online and Linear-Time Attention by Enforcing Monotonic Alignments
    Colin Raffel
    Peter Liu
    Thirty-fourth International Conference on Machine Learning(2017)
    Preview abstract Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models. View details
    Preview abstract Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive. View details
    Deep Music: Towards Musical Dialogue
    Mason Bretan
    Sageev Oore
    Larry Heck
    AAAI, AAAI, AAAI(2017)
    Preview abstract Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time applica- tion the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker. View details
    Audio Deepdream: Optimizing raw audio with convolutional networks
    Cinjon Resnick
    Diego Ardila
    International Society for Music Information Retrieval Conference, Google Brain(2016)
    Preview abstract The hallucinatory images of DeepDream opened up the floodgates for a recent wave of artwork generated by neural networks. In this work, we take first steps to applying this to audio. We believe a key to solving this problem is training a deep neural network to perform a music perception task on raw audio. Consequently, we have followed in the footsteps of Van den Oord et al and trained a network to predict embeddings that were themselves the result of a collaborative filtering model. A key difference is that we learn features directly from the raw audio, which creates a chain of differentiable functions from raw audio to high level features. We then use gradient descent on the network to extract samples of "dreamed" audio. View details
    Preview abstract The paper is itself a short abstract for BayLearn. View details
    Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning
    Natasha Jaques
    Shixiang Gu
    Richard E. Turner
    Deep Reinforcement Learning Workshop, NIPS(2016)
    Preview abstract Supervised learning with next-step prediction is a common way to train a sequence prediction model; however, it suffers from known failure modes and is notoriously difficult to train models to learn certain properties, such as having a coherent global structure. Reinforcement learning can be used to impose arbitrary properties on generated data by choosing appropriate reward functions. In this paper we propose a novel approach for sequence training, where we refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from stochastic optimal control (SOC). We explore the usefulness of our approach in the context of music gener- ation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using RL, where the reward function is a combination of rewards based on rules of music theory, as well as the output of another trained Note-RNN. We show that this combination of ML and RL can not only produce more pleasing melodies, but that it can significantly reduce unwanted behaviors and failure modes of the RNN. View details
    Tuning Recurrent Neural Networks with Reinforcement Learning
    Natasha Jaques
    Shixiang Shane Gu
    Richard E. Turner
    Proceedings of the International Conference on Learning Representations (ICLR)(2016)
    Preview abstract The approach of training sequence models using supervised learning and next-step prediction suffers from known failure modes. For example, it is notoriously difficult to ensure multi-step generated sequences have coherent global structure. We propose a novel sequence-learning approach in which we use a pre-trained Recurrent Neural Network (RNN) to supply part of the reward value in a Reinforcement Learning (RL) model. Thus, we can refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from KL control. We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data. View details
    Building Musically-relevant Audio Features through Multiple Timescale Representations
    Yoshua Bengio
    Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal(2012)
    Preview
    Temporal pooling and multiscale learning for automatic annotation and ranking of music audio
    Simon Lemieux
    Yoshua Bengio
    International Society for Music Information Retrieval (ISMIR 2011)
    Preview
    The Need for Music Information Retrieval with User-Centered and Multimodal Strategies
    Cynthia C.S. Liem
    Meinard Müller
    George Tzanetakis
    Alan Hanjalic
    MIRUM '11, ACM, Scottsdale, Arizona(2011), pp. 1-6
    Preview abstract Music is a widely enjoyed content type, existing in many multifaceted representations. With the digital information age, a lot of digitized music information has theoretically become available at the user’s fingertips. However, the abundance of information is too large-scaled and too diverse to annotate, oversee and present in a consistent and human manner, motivating the development of automated Music Information Retrieval (Music-IR) techniques. In this paper, we encourage to consider music content beyond a monomodal audio signal and argue that Music-IR approaches with multimodal and user-centered strategies are necessary to serve reallife usage patterns and maintain and improve accessibility of digital music data. After discussing relevant existing work in these directions, we show that the field of Music-IR faces similar challenges as neighboring fields, and thus suggest opportunities for joint collaboration and mutual inspiration. View details
    Probabilistic Models for Melodic Prediction
    Jean-Francois Paiement
    Samy Bengio
    Artificial Intelligence Journal, 173(2009), pp. 1266-1274
    Preview abstract Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments. View details
    A Distance Model for Rhythms
    Jean-Francois Paiement
    Yves Grandvalet
    Samy Bengio
    International Conference on Machine Learning (ICML)(2008)
    Preview abstract Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. View details
    A Generative Model for Rhythms
    Jean-Francois Paiement
    Samy Bengio
    Yves Grandvalet
    Neural Information Processing Systems, Workshop on Brain, Music and Cognition(2008)
    Preview abstract Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. View details
    A Generative Model for Distance Patterns in Music
    Jean-Francois Paiement
    Yves Grandvalet
    Samy Bengio
    NIPS Workshop on Music, Brain and Cognition(2007)
    Preview abstract In order to cope for the difficult problem of long term dependencies in sequential data in general, and in musical data in particular, a generative model for distance patterns especially designed for music is introduced. A specific implementation of the model when considering Hamming distances over rhythms is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy over two different music databases. View details
    Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to to musical acoustic study
    B. Leonard
    G. Sikora
    M. De Francisco
    129th Audio Engineering Society (AES) Convention, London(2010)
    Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to musical acoustic study
    B. Leonard
    G. Sikora
    M. De Francisco
    129th Audio Engineering Society (AES) Convention, London(2010)
    An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
    A. Courville
    Y. Bengio
    Neural Information Processing Systems Conference 22 (NIPS'09)(2010)
    Steerable Playlist Generation by Learning Song Similarity from Radio Station Playlists
    F. Maillet
    G. Desjardins
    P. Lamere
    Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
    Automatic identification of instrument classes in polyphonic and poly-instrument audio
    P. Hamel
    S. Wood
    Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
    Towards a musical beat emphasis function
    M. Davies
    M. Plumbley
    Proceedings of IEEE WASPAA, New Paltz, NY(2009)
    A generative model for rhythms
    {J.-F.} Paiement
    Y. Grandvalet
    S. Bengio
    ICML '08: Proceedings of the 25th International Conference on Machine Learning(2008)
    Automatic generation of social tags for music recommendation
    P. Lamere
    T. Bertin-Mahieux
    S. Green
    Neural Information Processing Systems Conference 20 (NIPS'07)(2008)
    On the use of Sparse Time Relative Auditory Codes for Music
    P-A. Manzagol
    T. Bertin-Mahieux
    Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008)
    Autotagger: A Model For Predicting Social Tags from Acoustic Features on Large Music Databases
    T. Bertin-Mahieux
    F. Maillet
    P. Lamere
    Journal of New Music Research, 37(2008), pp. 115-135
    Autotagging music using supervised machine learning
    T. Bertin-Mahieux
    P. Lamere
    Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
    Can't get you out of my head: {A} connectionist model of cyclic rehearsal
    H. Jaeger
    Modeling Communications with Robots and Virtual Humans, Springer-Verlag(2007)
    Using 3D Visualizations to Explore and Discover Music
    P. Lamere
    Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
    A Supervised Classification Algorithm For Note Onset Detection
    A. Lacoste
    EURASIP Journal on Applied Signal Processing, 2007(2007), pp. 1-13
    Beat Tracking Using an Autocorrelation Phase Matrix
    Proceedings of the 2007 International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, pp. 1313-1316
    Probabilistic Melodic Harmonization
    J.-F. Paiement
    S. Bengio
    Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, Springer-Verlag(2006), pp. 218-229
    Preview abstract We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions. View details
    Probabilistic Melodic Harmonization
    {J.-F.} Paiement
    S. Bengio
    Canadian Conference on AI, Springer(2006), pp. 218-229
    Aggregate Features and {AdaBoost} for Music Classification
    J. Bergstra
    N. Casagrande
    D. Erhan
    B. Kégl
    Machine Learning, 65(2006), pp. 473-484
    Beat Induction Using an Autocorrelation Phase Matrix
    The Proceedings of the 9th International Conference on Music Perception and Cognition (ICMPC9), Causal Productions(2006), pp. 931-932
    Predicting genre labels for artists using FreeDB
    J. Bergstra
    A. Lacoste
    Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), pp. 85-88
    Finding Long-Timescale Musical Structure with an Autocorrelation Phase Matrix
    Music Perception, 24(2006), pp. 167-176
    A Graphical Model for Chord Progressions Embedded in a Psychoacoustic Space
    J.-F. Paiement
    S. Bengio
    D. Barber
    International Conference on Machine Learning, ICML(2005)
    Preview abstract Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies. View details
    A graphical model for chord progressions embedded in a psychoacoustic space
    {J.-F.} Paiement
    S. Bengio
    D. Barber
    ICML '05: Proceedings of the 22nd international conference on Machine learning, ACM Press, New York, NY, USA(2005), pp. 641-648
    A Probabilistic Model for Chord Progressions
    {J.-F.} Paiement
    S. Bengio
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 312-319
    Frame-Level Audio Feature Extraction using {A}da{B}oost
    N. Casagrande
    B. Kégl
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 345-350
    Editorial: New Research in Rhythm Perception and Production
    S. K. Scott
    Music Perception, 22(2005), pp. 371-388
    A Probabilistic Model for Chord Progressions
    J.-F. Paiement
    S. Bengio
    International Conference on Music Information Retrieval, ISMIR(2005)
    Preview abstract Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies. View details
    Finding Meter in Music Using an Autocorrelation Phase Matrix and Shannon Entropy
    N. Casagrande
    Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 504-509
    Geometry in Sound: A Speech/Music Audio Classifier Inspired by an Image Classifier
    N. Casagrande
    B. Kegl
    Proceedings of the International Computer Music Conference (ICMC)(2005), pp. 207-210
    Music Perception, Guest Editor, Special Issue on Rhythm Perception and Production
    S. K. Scott
    Music Perception, 22 (3)(2005)
    Biologically Plausible Speech Recognition with {LSTM} Neural Nets
    A. Graves
    N. Beringer
    J. Schmidhuber
    Proceedings of the First Int'l Workshop on Biologically Inspired Approaches to Advanced Information Technology (Bio-ADIT)(2004), pp. 127-136
    Preview abstract Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) are local in space and time and closely related to a biological model of memory in the prefrontal cortex. Not only are they more biologically plausible than previous artificial RNNs, they also outperformed them on many artificially generated sequential processing tasks. This encouraged us to apply LSTM to more realistic problems, such as the recognition of spoken digits. Without any modification of the underlying algorithm, we achieved results comparable to state-of-the-art Hidden Markov Model (HMM) based recognisers on both the TIDIGITS and TI46 speech corpora. We conclude that LSTM should be further investigated as a biologically plausible basis for a bottom-up, neural net-based approach to speech recognition. View details
    A Machine-Learning Approach to Musical Sequence Induction That Uses Autocorrelation to Bridge Long Timelags
    The Proceedings of the Eighth International Conference on Music Perception and Cognition (ICMPC8), Causal Productions, Adelaide(2004), pp. 542-543
    Preview abstract One major challenge in using statistical sequence learning methods in the domain of music lies in bridging the long timelags that separate important musical events. Consider, for example, the chord changes that convey the basic structure of a pop song. A sequence learner that cannot predict chord changes will almost certainly not be able to generate new examples in a musical style or to categorize songs by style. Yet, it is surprisingly difficult for a sequence learner to bridge the long timelags necessary to identify when a chord change will occur and what its new value will be. This is the case because chord changes can be separated by dozens or hundreds of intervening notes. One could solve this problem by treating chords as being special (as did Mozer, NIPS 1991). But this is impractical---it requires chords to be labeled specially in the dataset, limiting the applicability of the model to non-labeled examples---and furthermore does not address the general issue of nested temporal structure in music. I will briefly describe this temporal structure (known commonly as "meter") and present a model that uses to its advantage an assumption that sequences are metrical. The model consists of an autocorrelation-based filtration that estimates online the most likely metrical tree (i.e. the frequency and phase of beat, measure, phrase &etc.) and uses that to generate a series of sequences varying at different rates. These sequences correspond to each level in the hierarchy. Multiple learners can be used to treat each series separately and their predictions can be combined to perform composition and categorization. I will present preliminary results that demonstrate the usefulness of this approach. Time permitting I will also compare the model to alternate approaches. View details
    {K}alman filters improve {LSTM} network performance in problems unsolvable by traditional recurrent nets
    J.A. Pérez-Ortiz
    F. A. Gers
    J. Schmidhuber
    Neural Networks, 16(2003), pp. 241-250
    Preview abstract The Long Short-Term Memory (LSTM) network trained by gradient descent solves difficult problems which traditional recurrent neural networks in general cannot. We have recently observed that the decoupled extended Kalman filter training algorithm allows for even better performance, reducing significantly the number of training steps when compared to the original gradient descent training algorithm. In this paper we present a set of experiments which are unsolvable by classical recurrent networks but which are solved elegantly and robustly and quickly by LSTM combined with Kalman filters. View details
    Finding Temporal Structure in Music: Blues Improvisation with {LSTM} Recurrent Networks
    J. Schmidhuber
    Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, IEEE, New York, pp. 747-756
    Preview abstract Few types of signal streams are as ubiquitous as music. Here we consider the problem of extracting essential ingredients of music signals, such as well-defined global temporal structure in the form of nested periodicities (or {\em meter}). Can we construct an adaptive signal processing device that learns by example how to generate new instances of a given musical style? Because recurrent neural networks can in principle learn the temporal structure of a signal, they are good candidates for such a task. Unfortunately, music composed by standard recurrent neural networks (RNNs) often lacks global coherence. The reason for this failure seems to be that RNNs cannot keep track of temporally distant events that indicate global music structure. Long Short-Term Memory (LSTM) has succeeded in similar domains where other RNNs have failed, such as timing \& counting and learning of context sensitive languages. In the current study we show that LSTM is also a good mechanism for learning to compose music. We present experimental results showing that LSTM successfully learns a form of blues music and is able to compose novel (and we believe pleasing) melodies in that style. Remarkably, once the network has found the relevant structure it does not drift from it: LSTM is able to play the blues with good timing and proper structure as long as one is willing to listen. View details
    Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and {LSTM}
    J. Schmidhuber
    F.A. Gers
    Neural Computation, 14(2002), pp. 2039-2041
    Preview abstract In response to Rodriguez' recent article (Rodriguez 2001) we compare the performance of simple recurrent nets and {\em ``Long Short-Term Memory''} (LSTM) recurrent nets on context-free and context-sensitive languages. View details
    Learning Context Sensitive Languages with {LSTM} Trained with {Kalman} Filters
    F.A. Gers
    J.A. Pérez-Ortiz
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 655-660
    Preview abstract Unlike traditional recurrent neural networks, the Long Short-Term Memory (LSTM) model generalizes well when presented with training sequences derived from regular and also simple nonregular languages. Our novel combination of LSTM and the decoupled extended Kalman filter, however, learns even faster and generalizes even better, requiring only the 10 shortest exemplars n <= 10 of the context sensitive language a^nb^nc^n to deal correctly with values of n up to 1000 and more. Even when we consider the relatively high update complexity per timestep, in many cases the hybrid offers faster learning than LSTM by itself. View details
    Finding Downbeats with a Relaxation Oscillator
    Psychological Research, 66(2002), pp. 18-25
    Preview abstract A relaxation oscillator model of neural spiking dynamics is applied to the task of finding downbeats in rhythmical patterns. The importance of downbeat discovery or {\em beat induction} is discussed, and the relaxation oscillator model is compared to other oscillator models. In a set of computer simulations the model is tested on 35 rhythmical patterns from Povel \& Essens (1985). The model performs well, making good predictions in 34 of 35 cases. In an analysis we identify some shortcomings of the model and relate model behavior to dynamical properties of relaxation oscillators. View details
    {DEKF-LSTM}
    F.A. Gers
    J.A. Perez-Ortiz
    J. Schmidhuber
    Proceedings of the 10th European Symposium on Artificial Neural Networks, ESANN 2002
    Learning The Long-Term Structure of the Blues
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 284-289
    Preview abstract In general music composed by recurrent neural networks (RNNs) suffers from a lack of global structure. Though networks can learn note-by-note transition probabilities and even reproduce phrases, they have been unable to learn an entire musical form and use that knowledge to guide composition. In this study, we describe model details and present experimental results showing that LSTM successfully learns a form of blues music and is able to compose novel (and some listeners believe pleasing) melodies in that style. Remarkably, once the network has found the relevant structure it does not drift from it: LSTM is able to play the blues with good timing and proper structure as long as one is willing to listen. View details
    Improving Long-Term Online Prediction with {Decoupled Extended Kalman Filters}
    J.A. Pérez-Ortiz
    J. Schmidhuber
    F.A. Gers
    Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 1055-1060
    Preview abstract Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform traditional RNNs when dealing with sequences involving not only short-term but also long-term dependencies. The decoupled extended Kalman filter learning algorithm (DEKF) works well in online environments and reduces significantly the number of training steps when compared to the standard gradient-descent algorithms. Previous work on LSTM, however, has always used a form of gradient descent and has not focused on true online situations. Here we combine LSTM with DEKF and show that this new hybrid improves upon the original learning algorithm when applied to online processing. View details
    A Network of Relaxation Oscillators that Finds Downbeats in Rhythms
    Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 1239-1247
    Preview abstract A network of relaxation oscillators is used to find downbeats in rhythmical patterns. In this study, a novel model is described in detail. Its behavior is tested by exposing it to patterns having various levels of rhythmic complexity. We analyze the performance of the model and relate its success to previous work dealing with fast synchrony in coupled oscillators. View details
    A Positive-Evidence Model for Rhythmical Beat Induction
    Journal of New Music Research, 30(2001), pp. 187-200
    Preview abstract The Normalized Positive (NPOS) model is a rule-based model that predicts downbeat location and pattern complexity in rhythmical patterns. Though derived from several existing models, the NPOS model is particularly effective at making correct predictions while at the same time having low complexity. In this paper, the details of the model are explored and a comparison is made to existing models. Several datasets are used to examine the complexity predictions of the model. Special attention is paid to the model's ability to account for the effects of musical experience on beat induction. View details
    Applying {LSTM} to Time Series Predictable Through Time-Window Approaches
    F. A. Gers
    J. Schmidhuber
    Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 669-676
    Preview abstract Long Short-Term Memory (LSTM) is able to solve many time series tasks unsolvable by feed-forward networks using fixed size time windows. Here we find that LSTM's superiority does {\em not} carry over to certain simpler time series tasks solvable by time window approaches: the Mackey-Glass series and the Santa Fe FIR laser emission series (Set A). This suggests t use LSTM only when simpler traditional approaches fail. View details
    Meter Through Synchrony: Processing Rhythmical Patterns with Relaxation Oscillators
    Ph.D. Thesis, Indiana University, Bloomington, IN(2000)
    Preview abstract This dissertation uses a network of relaxation oscillators to beat along with temporal signals. Relaxation oscillators exhibit interspersed slow-fast movement and model a wide array of biological oscillations. The model is built up gradually: first a single relaxation oscillator is exposed to rhythms and shown to be good at finding downbeats in them. Then large networks of oscillators are mutually coupled in an exploration of their internal synchronization behavior. It is demonstrated that appropriate weights on coupling connections cause a network to form multiple pools of oscillators having stable phase relationships. This is a promising first step towards networks that can recreate a rhythmical pattern from memory. In the full model, a coupled network of relaxation oscillators is exposed to rhythmical patterns. It is shown that the network finds downbeats in patterns while continuing to exhibit good internal stability. A novel non-dynamical model of downbeat induction called the Normalized Positive (NP) clock model is proposed, analyzed, and used to generate comparison predictions for the oscillator model. The oscillator model compares favorably to other dynamical approaches to beat induction such as adaptive oscillators. However, the relaxation oscillator model takes advantage of intrinsic synchronization stability to allow the creation of large coupled networks. This research lays the groundwork for a long-term research goal, a robotic arm that responds to rhythmical signals by tapping along. It also opens the door to future work in connectionist learning of long rhythmical patterns. View details
    Dynamics and Embodiment in Beat Induction
    M. Gasser
    Robert Port
    Rhythm Perception and Production, Swets and Zeitlinger, Lisse, The Netherlands(2000), pp. 157-170
    Preview abstract We provide an argument for using dynamical systems theory in the domain of beat induction. We motivate the study of beat induction and to relate beat induction to the more general study of human rhythm cognition. In doing so we compare a dynamical, embodied approach to a symbolic (traditional AI) one, paying particular attention to how the modeling approach brings with it tacit assumptions about what is being modeled. Please note that this is a philosophy paper about research that was, at the time of writing, very much in progress. View details
    Learning Simple Metrical Preferences in a Network of {F}itzhugh-{N}agumo Oscillators
    The Proceedings of the Twenty-First Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey(1999)
    Preview abstract Hebbian learning is used to train a network of oscillators to prefer periodic signals of pulses over aperiodic signals. Target signals consisted of metronome-like voltage pulses with varying amounts of inter-onset noise injected. (with 0\% noise yielding a periodic signal and more noise yielding more and more aperiodic signals.) The oscillators---piecewise-linear approximations (Abbott, 1990) to Fitzhugh-Nagumo oscillators---are trained using mean phase coherence as an objective function. Before training a network is shown to readily synchronize with signals having wide range of noise. After training on a series of noise-free signals, a network is shown to only synchronize with signals having little or no noise. This represents a bias towards periodicity and is explained by strong positive coupling connections between oscillators having harmonically-related periods. View details
    Meter as Mechanism: A Neural Network Model that Learns Metrical patterns
    M. Gasser
    R. Port
    Connect. Sci., 11, no. 2(1999), pp. 187-216
    Preview abstract One kind of prosodic structure that apparently underlies both music and some examples of speech production is meter. Yet detailed measurements of the timing of both music and speech show that the nested periodicities that define metrical structure can be quite noisy in time. What kind of system could produce or perceive such variable metrical timing patterns? And what would it take to be able to store and reproduce particular metrical patterns from long-term memory? We have developed a network of coupled oscillators that both produces and perceives patterns of pulses that conform to particular meters. In addition, beginning with an initial state with no biases, it can learn to prefer the particular meter that it has been previously exposed to. View details
    An Exploration of Representational Complexity via Coupled Oscillators
    T. Chemero
    Proceedings of the Tenth Midwest Artificial Intelligence and Cognitive Science Society, MIT Press, Cambridge, Mass.(1999)
    Preview abstract We note some inconsistencies in a view of representation which takes {\it decoupling} to be of key importance. We explore these inconsistencies using examples of representational vehicles taken from coupled oscillator theory and suggest a new way to reconcile {\it coupling} with {\it absence}. Finally, we tie these views to a teleological definition of representation. View details
    Perception of Simple Rhythmic Patterns in a Network of Oscillators
    M. Gasser
    The Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey(1996)
    Preview abstract This paper is concerned with the complex capacity to recognize and reproduce rhythmic patterns. While this capacity has not been well investigated, in broad qualitative terms it is clear that people can learn to identify and produce recurring patterns defined in terms of sequences of beats of varying intensity and rests: the rhythms behind waltzes, reels, sambas, etc. Our short term goal is a model which is "hard-wired" with knowledge of a set of such patterns. Presented with a portion of one of the patterns or a label for a pattern, the model should reproduce the pattern and continue to do so when the input is turned off. Our long-term goal is a model which can learn to adjust the connection strengths which implement particular patterns as it is exposed to input patterns. View details
    Representing Rhythmic Patterns in a Network of Oscillators
    M. Gasser
    The Proceedings of the International Conference on Music Perception and Cognition, Lawrence Erlbaum Associates, New Jersey(1996), pp. 361-366
    Preview abstract This paper describes an evolving computational model of the perception and pro-duction of simple rhythmic patterns. The model consists of a network of oscillators of different resting frequencies which couple with input patterns and with each other. Os-cillators whose frequencies match periodicities in the input tend to become activated. Metrical structure is represented explicitly in the network in the form of clusters of os-cillators whose frequencies and phase angles are constrained to maintain the harmonic relationships that characterize meter. Rests in rhythmic patterns are represented by ex-plicit rest oscillators in the network, which become activated when an expected beat in the pattern fails to appear. The model makes predictions about the relative difficulty of patterns and the effect of deviations from periodicity in the input. View details