![Douglas Eck](https://storage.googleapis.com/gweb-research2023-media/pubtools/154.png)
Douglas Eck
Doug is a Senior Research Director at Google, and leads research efforts at Google DeepMind in Generative Media, including image, video, 3D, music and audio generation. He also leads a broader group active in areas including Fundamental Learning Algorithms, Natural Language Processing, Multimodal Learning, Reinforcement Learning, Computer Vision and Generative Models. His own research lies at the intersection of machine learning and human-computer interaction (HCI). Doug created Magenta, an ongoing research project exploring the role of AI in art and music creation. He is also an advocate for PAIR, a multidisciplinary team that explores the human side of AI through fundamental research, building tools, creating design frameworks, and working with diverse communities.
Before joining Google in 2010, Doug did research in music perception, aspects of music performance, machine learning for large audio datasets and music recommendation. He completed his PhD in Computer Science and Cognitive Science at Indiana University in 2000 and went on to a postdoctoral fellowship with Juergen Schmidhuber at IDSIA in Lugano Switzerland. From 2003-2010, Doug was faculty in Computer Science in the University of Montreal machine learning group (now MILA machine learning lab), where he became Associate Professor.
Authored Publications
Google Publications
Other Publications
Sort By
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Slav Petrov
arxiv:2204.02311(2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Deduplicating Training Data Makes Language Models Better
Andrew Nystrom
Chiyuan Zhang
Chris Callison-Burch
Nicholas Carlini
(2022) (to appear)
Preview abstract
As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts.
We find that existing NLP datasets are highly repetitive and contain duplicated examples.
For example, there is an example in the training dataset C4 that has over 200,000 near duplicates.
As a whole, we find that 1.68% of the C4 are near-duplicates.
Worse, we find a 1% overlap between the training and testing sets in these datasets.
Duplicate examples in training data inappropriately biases the distribution of rare/common sequences.
Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples.
Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO.
View details
Emergent Social Learning via Multi-agent Reinforcement Learning
Kamal Ndousse
Sergey Levine
Natasha Jaques
International Conference on Machine Learning (ICML)(2021)
Joint Attention for Multi-Agent Coordination and Social Learning
Dennis Lee
Natasha Jaques
Jiaxing Wu
Dale Schuurmans
ICRA Workshop on Social Intelligence in Humans and Robots(2021)
Preview abstract
Joint attention — the ability to purposefully coordinate your attention with another person, and mutually attend to the same thing — is an important milestone in human cognitive development. In this paper, we ask whether joint attention can be useful as a mechanism for improving multi-agent coordination and social learning. We first develop deep reinforcement learning (RL) agents with a recurrent visual attention architecture. We then train agents to minimize the difference between the attention weights that they apply to the environment at each timestep, and the attention of other agents. Our results show that this joint attention incentive improves agents’ ability to solve difficult coordination tasks, by helping overcome the problem of exploring the combinatorial multi-agent action space. Joint attention leads to higher performance than a competitive centralized critic baseline across multiple environments. Further, we show that joint attention enhances agents’ ability to learn from experts present in their environment, even when performing single-agent tasks. Taken together, these findings suggest that joint attention may be a useful inductive bias for improving multi-agent learning.
View details
Automatic Detection of Generated Text is Easiest when Humans are Fooled
Chris Callison-Burch
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020), pp. 1808-1822
Preview abstract
Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies—top-_k_, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems.
View details
Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
Natasha Jaques
Jennifer McCleary
David Ha
Fred Bertsch
Rosalind Picard
International Joint Conference on Artificial Intelligence (IJCAI) 2018(2020), pp. 1-9
Preview abstract
A known deficit of modern machine learning (ML) and deep learning (DL) methodology
is that models must be carefully fine-tuned in order to solve a particular task. Most
algorithms cannot generalize well to even highly similar tasks, let alone exhibit signs of
general artificial intelligence (AGI). To address this problem, researchers have explored
developing loss functions that act as intrinsic motivators that could motivate an ML or
DL agent to learn across a number of domains. This paper argues that an important
and useful intrinsic motivator is that of social interaction. We posit that making an AI
agent aware of implicit social feedback from humans can allow for faster learning of more
generalizable and useful representations, and could potentially impact AI safety. We collect
social feedback in the form of facial expression reactions to samples from Sketch RNN, an
LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We
use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small
group of viewers, by optimizing the model to produce sketches that it predicts will lead
to more positive facial expressions. We show in multiple independent evaluations that
the model trained with facial feedback produced sketches that are more highly rated, and
induce significantly more positive facial expressions. Thus, we establish that implicit social
feedback can improve the output of a deep learning model.
View details
Towards Better Storylines with Sentence-Level Language Models
David Grangier
Chris Callison-Burch
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020), pp. 1808-1822
Preview abstract
This work proposes a sentence-level language model which predicts
the next sentence in a story given the embeddings of the previous
sentences. The model operates at the sentence-level and selects the
next sentence within a fine set of fluent alternatives. By working
with sentence embeddings instead of word embeddings, our model is
able to efficiently consider a large number of alternative sentences.
By considering only fluent sentences, our model is relieved from modeling
fluency and can focus on longer range dependencies. Our method achieves
state-of-the-art accuracy on the StoryCloze task in the unsupervised setting.
View details
Magenta Studio: Augmenting Creativity with Deep Learning in Ableton Live
Yotam Mann
Jon Gillick
Monica Dinculescu
Carey Radebaugh
Curtis Hawthorne
Proceedings of the International Workshop on Musical Metacreation (MUME)(2019)
Preview abstract
The field of Musical Metacreation (MuMe) has pro-duced impressive results for both autonomous and in-teractive creativity. However, there are few examplesof these systems crossing over to the “mainstream” ofmusic creation and consumption. We tie together ex-isting frameworks (Electron, TensorFlow.js, and MaxFor Live) to develop a system whose purpose is tobring the promise of interactive MuMe to the realmof professional music creators. Combining compellingapplications of deep learning based music generationwith a focus on ease of installation and use in a pop-ular DAW, we hope to expose more musicians and pro-ducers to the potential of using such systems in theircreative workflows. Our suite of plug-ins for AbletonLive, named Magenta Studio, is available for downloadathttp://g.co/magenta/studioalong with itsopen source implementation.
View details
Preview abstract
We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Though seq2seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix, Vid2Vid) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and learning to invert them has real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score).
View details
Music Transformer: Generating Music with Long-Term Structure
Ashish Vaswani
Jakob Uszkoreit
Noam Shazeer
Ian Simon
Curtis Hawthorne
Monica Dinculescu
ICLR(2019)
Preview abstract
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.
View details
Preview abstract
Dramatic advances in generative models have resulted in near photographic quality for artificially rendered faces, animals and other objects in the natural world. In spite of such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we attempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts crawled from the web and highlight how such a model captures the statistical dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design.
View details
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Curtis Hawthorne
Andrew Stasyuk
Ian Simon
Sander Dieleman
Erich Elsen
ICLR(2019)
Preview abstract
Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.
View details
Unsupervised Hierarchical Story Infilling
David Grangier
Chris Callison-Burch
NAACL 2019 Workshop on Narrative Understanding, Minneapolis, MN(2019)
Preview abstract
Story infilling involves predicting words to go into a missing span from a story.
This challenging task has the potential to transform interactive tools for creative writing.
However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity.
View details
Identifying the intersections: User experience + research scientist collaboration in a generative machine learning interface
Jess Scon Holbrook
ACM CHI Conference 2019(2019)
Preview abstract
Creative generative machine learning interfaces are stronger when multiple actors bearing different points of view actively contribute to them. User experience (UX) research and design involvement in the creation of machine learning (ML) models help ML research scientists to more effectively identify human needs that ML models will fulfill. The People and AI Research (PAIR) group within Google developed a novel program method in which UXers are embedded into an ML research group for three months to provide a human-centered perspective on the creation of ML models. The first full-time cohort of UXers were embedded in a team of ML research scientists focused on deep generative models to assist in music composition. Here, we discuss the structure and goals of the program, challenges we faced during execution, and insights gained as a result of the process. We offer practical suggestions for how to foster communication between UX and ML research teams and recommended UX design processes for building creative generative machine learning interfaces.
View details
Preview abstract
We present sketch-rnn, a recurrent neural network (RNN) able to construct stroke-based drawings of common objects. The model is trained on thousands of crude human-drawn images representing hundreds of classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format.
View details
Visualizing Music Self-Attention
Monica Dinculescu
Ashish Vaswani
NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language(2018)
Preview abstract
Like language, music can be represented as a sequence of discrete symbols that form a hierarchical syntax, with notes being roughly like characters and motifs of notes like words. Unlike text however, music relies heavily on repetition on multiple timescales to build structure and meaning.
The Music Transformer has shown compelling results in generating music with structure~\citep{huang2018music}.
In this paper, we introduce a tool for visualizing self-attention on polyphonic music with an interactive pianoroll. We use music transformer as both a descriptive tool and a generative model. For the former, we use it to analyze existing music to see if the resulting self-attention structure corroborates with the musical structure known from music theory. For the latter, we inspect the model's self-attention during generation, in order to understand how past notes affect future ones. We also compare and contrast the attention structure of regular attention to that of relative attention \citep{shaw2018self, huang2018music}, and examine its impact on the resulting generated music. For example, for the JSB Chorales dataset, a model trained with relative attention is more consistent in attending to all the voices in the preceding timestep and the chords before, and at cadences to the beginning of a phrase, allowing it to create an arc. We hope that our analyses will offer more evidence for relative self-attention as a powerful inductive bias for modeling music. We invite the reader to checkout video animations of music attention and interact with the visualizations at \url{https://storage.googleapis.com/nips-workshop-visualization/index.html}.
View details
Preview abstract
We argue for the benefit of designing deep generative models through mixed-initiative combinations of deep learning algorithms and human specifications for authoring sequential content, such as stories and music.
Sequence models have shown increasingly convincing results in domains such as auto-completion, speech to text, and translation; however, longer-term structure remains a major challenge. Given lengthy inputs and outputs, deep generative systems still lack reliable representations of beginnings, middles, and ends, which are standard aspects of creating content in domains such as music composition. This paper aims to contribute a framework for mixed-initiative learning approaches, specifically for creative deep generative systems, and presents a case study of a deep generative model for music, Counterpoint by Convolutional Neural Network (Coconet).
View details
Preview abstract
Advances in machine learning have the potential to radically
reshape interactions between humans and computers. Deep
learning makes it possible to discover powerful representations
that are capable of capturing the latent structure of highdimensional
data such as music. By creating interactive latent
space “palettes” of musical sequences and timbres, we
demonstrate interfaces for musical creation made possible
by machine learning. We introduce an interface to the intuitive,
low-dimensional control spaces for high-dimensional
note sequences, allowing users to explore a compositional
space of melodies or drum beats in a simple 2-D grid. Furthermore,
users can define 1-D trajectories in the 2-D space
for autonomous, continuous morphing during improvisation.
Similarly for timbre, our interface to a learned latent space
of audio provides an intuitive and smooth search space for
morphing between the timbres of different instruments. We
remove technical and computational barriers by embedding
pre-trained networks into a browser-based GPU-accelerated
framework, making the systems accessible to a wide range of
users while maintaining potential for creative flexibility and
personalization.
View details
Learning via Social Awareness: Improving a Deep Generative Sketching Model with Facial Feedback
Natasha Jaques
Jennifer McCleary
David Ha
Fred Bertsch
Rosalind Picard
ICLR 2018 Workshop
Preview abstract
In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that function as intrinsic motivators in the absence of external rewards. This paper takes the position that current research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for more rapid learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more smiling and less frowning than the baseline. Thus, we establish that implicit social feedback can improve the output of a deep learning model.
View details
Onsets and Frames: Dual-Objective Piano Transcription
Curtis Hawthorne
Erich Elsen
Jialin Song
Ian Simon
Colin Raffel
Sageev Oore
Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
Preview abstract
We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions.
View details
A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
Colin Raffel
Curtis Hawthorne
International Conference on Machine Learning (ICML)(2018)
Preview abstract
The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at https://goo.gl/magenta/musicvae-code.
View details
Learning via social awareness: improving sketch representations with facial feedback
Natasha Jaques
David Ha
Fred Bertsch
Rosalind Picard
International Conference on Learning Representations(2018)
Preview abstract
In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model.
View details
Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
Natasha Jaques
Shixiang Gu
Dzmitry Bahdanau
José Miguel Hernández-Lobato
Richard E. Turner
ICML(2017)
Preview abstract
This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.
View details
Tuning Recurrent Neural Networks With Reinforcement Learning
Natasha Jaques
Shixiang Gu
Dzmitry Bahdanau
Jose Miguel Hernandez Lobato
Richard E. Turner
ICLR Workshop(2017)
Preview abstract
This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.
View details
Counterpoint by Convolution
Tim Cooijmans
Aaron Courville
Proceedings of ISMIR 2017
Preview abstract
Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce COCONET, a convolutional neural network in the NADE family of generative models (Uria et al., 2016). Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation.
View details
Learning to Create Piano Performances
Sageev Oore
Ian Simon
Sander Dieleman
NIPS 2017 Workshop on Machine Learning and Creativity
Preview abstract
Nearly all previous work on music generation has focused on creating pieces that are, effectively, musical scores. In contrast, we learn to create piano performances: besides predicting the notes to be played, we also predict expressive variations in the timing and musical dynamics (loudness). We provided samples generated by our system for informal feedback to a set of professional musicians and composers, and the samples were well-received. Overall, the comments indicate that our system is generating music that, while lacking high-level structure, does indeed sound very much like human performance, and is closely reminiscent of the classical piano repertoire.
View details
Preview abstract
GANs provide a framework for training generative models which mimic a data distribution. However, in many cases we wish to train a generative model to optimize some auxiliary objective function within the data it generates, such as making more aesthetically pleasing images. In some cases, these objective functions are difficult to evaluate, e.g. they may require human interaction. Here, we develop a system for efficiently training a GAN to increase a generic rate of positive user interactions, which could represent aesthetic ratings or any other objective. To do this, we build a model of human behavior in the targeted domain from a relatively small set of interactions, and then use this behavioral model as an auxiliary loss function to improve the generative model. As a proof of concept, we demonstrate that this system is successful at improving positive interaction rates simulated from a variety of objectives, and characterize some factors that affect its performance.
View details
Preview abstract
In this work we develop recurrent variational autoencoders (VAEs) trained to
reproduce short musical sequences and demonstrate their use as a creative device
both via random sampling and data interpolation. Furthermore, by using a novel
hierarchical decoder, we show that we are able to model long sequences with
musical structure for both individual instruments and a three-piece band (lead, bass,
and drums). Finally, we demonstrate the effectiveness of scheduled sampling in
significantly improving our reconstruction accuracy.
View details
Online and Linear-Time Attention by Enforcing Monotonic Alignments
Colin Raffel
Peter Liu
Thirty-fourth International Conference on Machine Learning(2017)
Preview abstract
Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems.
However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity.
Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time.
We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models.
View details
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Cinjon Resnick
Sander Dieleman
Karen Simonyan
Mohammad Norouzi
ICML(2017)
Preview abstract
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.
View details
Preview abstract
Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time applica- tion the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker.
View details
Audio Deepdream: Optimizing raw audio with convolutional networks
Cinjon Resnick
Diego Ardila
International Society for Music Information Retrieval Conference, Google Brain(2016)
Preview abstract
The hallucinatory images of DeepDream opened up the floodgates for a recent wave of artwork generated by neural networks. In this work, we take first steps to applying this to audio. We believe a key to solving this problem is training a deep neural network to perform a music perception task on raw audio. Consequently, we have followed in the footsteps of Van den Oord et al and trained a network to predict embeddings that were themselves the result of a collaborative filtering model. A key difference is that we learn features directly from the raw audio, which creates a chain of differentiable functions from raw audio to high level features. We then use gradient descent on the network to extract samples of "dreamed" audio.
View details
Preview abstract
The paper is itself a short abstract for BayLearn.
View details
Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning
Natasha Jaques
Shixiang Gu
Richard E. Turner
Deep Reinforcement Learning Workshop, NIPS(2016)
Preview abstract
Supervised learning with next-step prediction is a common way to train a sequence prediction model; however, it suffers from known failure modes and is notoriously difficult to train models to learn certain properties, such as having a coherent global structure. Reinforcement learning can be used to impose arbitrary properties on generated data by choosing appropriate reward functions. In this paper we propose a novel approach for sequence training, where we refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from stochastic optimal control (SOC). We explore the usefulness of our approach in the context of music gener- ation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using RL, where the reward function is a combination of rewards based on rules of music theory, as well as the output of another trained Note-RNN. We show that this combination of ML and RL can not only produce more pleasing melodies, but that it can significantly reduce unwanted behaviors and failure modes of the RNN.
View details
Tuning Recurrent Neural Networks with Reinforcement Learning
Natasha Jaques
Shixiang Shane Gu
Richard E. Turner
Proceedings of the International Conference on Learning Representations (ICLR)(2016)
Preview abstract
The approach of training sequence models using supervised learning and next-step prediction suffers from known failure modes. For example, it is notoriously difficult to ensure multi-step generated sequences have coherent global structure. We propose a novel sequence-learning approach in which we use a pre-trained Recurrent Neural Network (RNN) to supply part of the reward value in a Reinforcement Learning (RL) model. Thus, we can refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from KL control. We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data.
View details
Building Musically-relevant Audio Features through Multiple Timescale Representations
Preview
Yoshua Bengio
Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal(2012)
Temporal pooling and multiscale learning for automatic annotation and ranking of music audio
Preview
Simon Lemieux
Yoshua Bengio
International Society for Music Information Retrieval (ISMIR 2011)
The Need for Music Information Retrieval with User-Centered and Multimodal Strategies
Cynthia C.S. Liem
Meinard Müller
George Tzanetakis
Alan Hanjalic
MIRUM '11, ACM, Scottsdale, Arizona(2011), pp. 1-6
Preview abstract
Music is a widely enjoyed content type, existing in many multifaceted representations. With the digital information age, a lot of digitized music information has theoretically become available at the user’s fingertips. However, the abundance of information is too
large-scaled and too diverse to annotate, oversee and present in a consistent and human manner, motivating the development of automated Music Information Retrieval (Music-IR) techniques.
In this paper, we encourage to consider music content beyond a monomodal audio signal and argue that Music-IR approaches with multimodal and user-centered strategies are necessary to serve reallife usage patterns and maintain and improve accessibility of digital music data. After discussing relevant existing work in these directions, we show that the field of Music-IR faces similar challenges as neighboring fields, and thus suggest opportunities for joint collaboration and mutual inspiration.
View details
Probabilistic Models for Melodic Prediction
Jean-Francois Paiement
Samy Bengio
Artificial Intelligence Journal, 173(2009), pp. 1266-1274
Preview abstract
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.
View details
A Distance Model for Rhythms
Jean-Francois Paiement
Yves Grandvalet
Samy Bengio
International Conference on Machine Learning (ICML)(2008)
Preview abstract
Modeling long-term dependencies in time series has proved very
difficult to achieve with traditional machine learning methods. This
problem occurs when considering music data. In this paper, we
introduce a model for rhythms based on the distributions
of distances between subsequences. A specific implementation of the
model when considering Hamming distances over a simple rhythm
representation is described. The proposed model consistently
outperforms a standard Hidden Markov Model in terms of conditional
prediction accuracy on two different music databases.
View details
A Generative Model for Rhythms
Jean-Francois Paiement
Samy Bengio
Yves Grandvalet
Neural Information Processing Systems, Workshop on Brain, Music and Cognition(2008)
Preview abstract
Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.
View details
A Generative Model for Distance Patterns in Music
Jean-Francois Paiement
Yves Grandvalet
Samy Bengio
NIPS Workshop on Music, Brain and Cognition(2007)
Preview abstract
In order to cope for the difficult problem of long term dependencies in
sequential data in general, and in musical data in particular, a generative
model for distance patterns especially
designed for music is introduced. A specific implementation of
the model when considering Hamming distances over rhythms is
described. The proposed model consistently outperforms a standard
Hidden Markov Model in terms of conditional prediction accuracy over
two different music databases.
View details
Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to to musical acoustic study
B. Leonard
G. Sikora
M. De Francisco
129th Audio Engineering Society (AES) Convention, London(2010)
Acoustic Space Sampling and the Grand Piano in a Non-Anechoic Environment: a recordist-centric approach to musical acoustic study
B. Leonard
G. Sikora
M. De Francisco
129th Audio Engineering Society (AES) Convention, London(2010)
An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
A. Courville
Y. Bengio
Neural Information Processing Systems Conference 22 (NIPS'09)(2010)
Steerable Playlist Generation by Learning Song Similarity from Radio Station Playlists
F. Maillet
G. Desjardins
P. Lamere
Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
Automatic identification of instrument classes in polyphonic and poly-instrument audio
P. Hamel
S. Wood
Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009)
Towards a musical beat emphasis function
A generative model for rhythms
{J.-F.} Paiement
Y. Grandvalet
S. Bengio
ICML '08: Proceedings of the 25th International Conference on Machine Learning(2008)
Automatic generation of social tags for music recommendation
P. Lamere
T. Bertin-Mahieux
S. Green
Neural Information Processing Systems Conference 20 (NIPS'07)(2008)
On the use of Sparse Time Relative Auditory Codes for Music
P-A. Manzagol
T. Bertin-Mahieux
Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008)
Autotagger: A Model For Predicting Social Tags from Acoustic Features on Large Music Databases
T. Bertin-Mahieux
F. Maillet
P. Lamere
Journal of New Music Research, 37(2008), pp. 115-135
Autotagging music using supervised machine learning
T. Bertin-Mahieux
P. Lamere
Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
Can't get you out of my head: {A} connectionist model of cyclic rehearsal
Using 3D Visualizations to Explore and Discover Music
P. Lamere
Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007)
A Supervised Classification Algorithm For Note Onset Detection
Beat Tracking Using an Autocorrelation Phase Matrix
Proceedings of the 2007 International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, pp. 1313-1316
Probabilistic Melodic Harmonization
J.-F. Paiement
S. Bengio
Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, Springer-Verlag(2006), pp. 218-229
Preview abstract
We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.
View details
Probabilistic Melodic Harmonization
Aggregate Features and {AdaBoost} for Music Classification
Beat Induction Using an Autocorrelation Phase Matrix
The Proceedings of the 9th International Conference on Music Perception and Cognition (ICMPC9), Causal Productions(2006), pp. 931-932
Predicting genre labels for artists using FreeDB
J. Bergstra
A. Lacoste
Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), pp. 85-88
Finding Long-Timescale Musical Structure with an Autocorrelation Phase Matrix
Music Perception, 24(2006), pp. 167-176
A Graphical Model for Chord Progressions Embedded in a Psychoacoustic Space
Preview abstract
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
View details
A graphical model for chord progressions embedded in a psychoacoustic space
{J.-F.} Paiement
S. Bengio
D. Barber
ICML '05: Proceedings of the 22nd international conference on Machine learning, ACM Press, New York, NY, USA(2005), pp. 641-648
A Probabilistic Model for Chord Progressions
{J.-F.} Paiement
S. Bengio
Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 312-319
Frame-Level Audio Feature Extraction using {A}da{B}oost
N. Casagrande
B. Kégl
Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 345-350
Editorial: New Research in Rhythm Perception and Production
A Probabilistic Model for Chord Progressions
J.-F. Paiement
S. Bengio
International Conference on Music Information Retrieval, ISMIR(2005)
Preview abstract
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
View details
Finding Meter in Music Using an Autocorrelation Phase Matrix and Shannon Entropy
N. Casagrande
Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London: University of London, pp. 504-509
Geometry in Sound: A Speech/Music Audio Classifier Inspired by an Image Classifier
N. Casagrande
B. Kegl
Proceedings of the International Computer Music Conference (ICMC)(2005), pp. 207-210
Music Perception, Guest Editor, Special Issue on Rhythm Perception and Production
Biologically Plausible Speech Recognition with {LSTM} Neural Nets
A. Graves
N. Beringer
J. Schmidhuber
Proceedings of the First Int'l Workshop on Biologically Inspired Approaches to Advanced Information Technology (Bio-ADIT)(2004), pp. 127-136
Preview abstract
Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) are local in space and time and closely related to a biological model of memory in the prefrontal cortex. Not only are they more biologically plausible than previous artificial RNNs, they also outperformed them on many artificially generated sequential processing tasks. This encouraged us to apply LSTM to more realistic problems, such as the recognition of spoken digits. Without any modification of the underlying algorithm, we achieved results comparable to state-of-the-art Hidden Markov Model (HMM) based recognisers on both the TIDIGITS and TI46 speech corpora. We conclude that LSTM should be further investigated as a biologically plausible basis for a bottom-up, neural net-based approach to speech recognition.
View details
A Machine-Learning Approach to Musical Sequence Induction That Uses Autocorrelation to Bridge Long Timelags
The Proceedings of the Eighth International Conference on Music Perception and Cognition (ICMPC8), Causal Productions, Adelaide(2004), pp. 542-543
Preview abstract
One major challenge in using statistical sequence learning methods in the domain of music lies in bridging the long timelags that separate important musical events. Consider, for example, the chord changes that convey the basic structure of a pop song. A sequence learner that cannot predict chord changes will almost certainly not be able to generate new examples in a musical style or to categorize songs by style. Yet, it is surprisingly difficult for a sequence learner to bridge the long timelags necessary to identify when a chord change will occur and what its new value will be. This is the case because chord changes can be separated by dozens or hundreds of intervening notes. One could solve this problem by treating chords as being special (as did Mozer, NIPS 1991). But this is impractical---it requires chords to be labeled specially in the dataset, limiting the applicability of the model to non-labeled examples---and furthermore does not address the general issue of nested temporal structure in music. I will briefly describe this temporal structure (known commonly as "meter") and present a model that uses to its advantage an assumption that sequences are metrical. The model consists of an autocorrelation-based filtration that estimates online the most likely metrical tree (i.e. the frequency and phase of beat, measure, phrase &etc.) and uses that to generate a series of sequences varying at different rates. These sequences correspond to each level in the hierarchy. Multiple learners can be used to treat each series separately and their predictions can be combined to perform composition and categorization. I will present preliminary results that demonstrate the usefulness of this approach. Time permitting I will also compare the model to alternate approaches.
View details
{K}alman filters improve {LSTM} network performance in problems unsolvable by traditional recurrent nets
Preview abstract
The Long Short-Term Memory (LSTM) network trained by gradient descent solves difficult problems which traditional recurrent neural networks in general cannot. We have recently observed that the decoupled extended Kalman filter training algorithm allows for even better performance, reducing significantly the number of training steps when compared to the original gradient descent training algorithm. In this paper we present a set of experiments which are unsolvable by classical recurrent networks but which are solved elegantly and robustly and quickly by LSTM combined with Kalman filters.
View details
Finding Temporal Structure in Music: Blues Improvisation with {LSTM} Recurrent Networks
J. Schmidhuber
Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, IEEE, New York, pp. 747-756
Preview abstract
Few types of signal streams are as ubiquitous as music. Here we consider the problem of extracting essential ingredients of music signals, such as well-defined global temporal structure in the form of nested periodicities (or {\em meter}). Can we construct an adaptive signal processing device that learns by example how to generate new instances of a given musical style? Because recurrent neural networks can in principle learn the temporal structure of a signal, they are good candidates for such a task. Unfortunately, music composed by standard recurrent neural networks (RNNs) often lacks global coherence. The reason for this failure seems to be that RNNs cannot keep track of temporally distant events that indicate global music structure. Long Short-Term Memory (LSTM) has succeeded in similar domains where other RNNs have failed, such as timing \& counting and learning of context sensitive languages. In the current study we show that LSTM is also a good mechanism for learning to compose music. We present experimental results showing that LSTM successfully learns a form of blues music and is able to compose novel (and we believe pleasing) melodies in that style. Remarkably, once the network has found the relevant structure it does not drift from it: LSTM is able to play the blues with good timing and proper structure as long as one is willing to listen.
View details
Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and {LSTM}
Preview abstract
In response to Rodriguez' recent article (Rodriguez 2001) we compare the performance of simple recurrent nets and {\em ``Long Short-Term Memory''} (LSTM) recurrent nets on context-free and context-sensitive languages.
View details
Learning Context Sensitive Languages with {LSTM} Trained with {Kalman} Filters
F.A. Gers
J.A. Pérez-Ortiz
J. Schmidhuber
Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 655-660
Preview abstract
Unlike traditional recurrent neural networks, the Long Short-Term Memory (LSTM) model generalizes well when presented with training sequences derived from regular and also simple nonregular languages. Our novel combination of LSTM and the decoupled extended Kalman filter, however, learns even faster and generalizes even better, requiring only the 10 shortest exemplars n <= 10 of the context sensitive language a^nb^nc^n to deal correctly with values of n up to 1000 and more. Even when we consider the relatively high update complexity per timestep, in many cases the hybrid offers faster learning than LSTM by itself.
View details
Finding Downbeats with a Relaxation Oscillator
Psychological Research, 66(2002), pp. 18-25
Preview abstract
A relaxation oscillator model of neural spiking dynamics is applied to the task of finding downbeats in rhythmical patterns. The importance of downbeat discovery or {\em beat induction} is discussed, and the relaxation oscillator model is compared to other oscillator models. In a set of computer simulations the model is tested on 35 rhythmical patterns from Povel \& Essens (1985). The model performs well, making good predictions in 34 of 35 cases. In an analysis we identify some shortcomings of the model and relate model behavior to dynamical properties of relaxation oscillators.
View details
{DEKF-LSTM}
F.A. Gers
J.A. Perez-Ortiz
J. Schmidhuber
Proceedings of the 10th European Symposium on Artificial Neural Networks, ESANN 2002
Learning The Long-Term Structure of the Blues
J. Schmidhuber
Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 284-289
Preview abstract
In general music composed by recurrent neural networks (RNNs) suffers from a lack of global structure. Though networks can learn note-by-note transition probabilities and even reproduce phrases, they have been unable to learn an entire musical form and use that knowledge to guide composition. In this study, we describe model details and present experimental results showing that LSTM successfully learns a form of blues music and is able to compose novel (and some listeners believe pleasing) melodies in that style. Remarkably, once the network has found the relevant structure it does not drift from it: LSTM is able to play the blues with good timing and proper structure as long as one is willing to listen.
View details
Improving Long-Term Online Prediction with {Decoupled Extended Kalman Filters}
J.A. Pérez-Ortiz
J. Schmidhuber
F.A. Gers
Artificial Neural Networks -- ICANN 2002 (Proceedings), Springer, Berlin, pp. 1055-1060
Preview abstract
Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform traditional RNNs when dealing with sequences involving not only short-term but also long-term dependencies. The decoupled extended Kalman filter learning algorithm (DEKF) works well in online environments and reduces significantly the number of training steps when compared to the standard gradient-descent algorithms. Previous work on LSTM, however, has always used a form of gradient descent and has not focused on true online situations. Here we combine LSTM with DEKF and show that this new hybrid improves upon the original learning algorithm when applied to online processing.
View details
A Network of Relaxation Oscillators that Finds Downbeats in Rhythms
Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 1239-1247
Preview abstract
A network of relaxation oscillators is used to find downbeats in rhythmical patterns. In this study, a novel model is described in detail. Its behavior is tested by exposing it to patterns having various levels of rhythmic complexity. We analyze the performance of the model and relate its success to previous work dealing with fast synchrony in coupled oscillators.
View details
A Positive-Evidence Model for Rhythmical Beat Induction
Journal of New Music Research, 30(2001), pp. 187-200
Preview abstract
The Normalized Positive (NPOS) model is a rule-based model that predicts downbeat location and pattern complexity in rhythmical patterns. Though derived from several existing models, the NPOS model is particularly effective at making correct predictions while at the same time having low complexity. In this paper, the details of the model are explored and a comparison is made to existing models. Several datasets are used to examine the complexity predictions of the model. Special attention is paid to the model's ability to account for the effects of musical experience on beat induction.
View details
Applying {LSTM} to Time Series Predictable Through Time-Window Approaches
F. A. Gers
J. Schmidhuber
Artificial Neural Networks -- ICANN 2001 (Proceedings), Springer, Berlin, pp. 669-676
Preview abstract
Long Short-Term Memory (LSTM) is able to solve many time series tasks unsolvable by feed-forward networks using fixed size time windows. Here we find that LSTM's superiority does {\em not} carry over to certain simpler time series tasks solvable by time window approaches: the Mackey-Glass series and the Santa Fe FIR laser emission series (Set A). This suggests t use LSTM only when simpler traditional approaches fail.
View details
Meter Through Synchrony: Processing Rhythmical Patterns with Relaxation Oscillators
Ph.D. Thesis, Indiana University, Bloomington, IN(2000)
Preview abstract
This dissertation uses a network of relaxation oscillators to beat along with temporal signals. Relaxation oscillators exhibit interspersed slow-fast movement and model a wide array of biological oscillations. The model is built up gradually: first a single relaxation oscillator is exposed to rhythms and shown to be good at finding downbeats in them. Then large networks of oscillators are mutually coupled in an exploration of their internal synchronization behavior. It is demonstrated that appropriate weights on coupling connections cause a network to form multiple pools of oscillators having stable phase relationships. This is a promising first step towards networks that can recreate a rhythmical pattern from memory. In the full model, a coupled network of relaxation oscillators is exposed to rhythmical patterns. It is shown that the network finds downbeats in patterns while continuing to exhibit good internal stability. A novel non-dynamical model of downbeat induction called the Normalized Positive (NP) clock model is proposed, analyzed, and used to generate comparison predictions for the oscillator model. The oscillator model compares favorably to other dynamical approaches to beat induction such as adaptive oscillators. However, the relaxation oscillator model takes advantage of intrinsic synchronization stability to allow the creation of large coupled networks. This research lays the groundwork for a long-term research goal, a robotic arm that responds to rhythmical signals by tapping along. It also opens the door to future work in connectionist learning of long rhythmical patterns.
View details
Dynamics and Embodiment in Beat Induction
M. Gasser
Robert Port
Rhythm Perception and Production, Swets and Zeitlinger, Lisse, The Netherlands(2000), pp. 157-170
Preview abstract
We provide an argument for using dynamical systems theory in the domain of beat induction. We motivate the study of beat induction and to relate beat induction to the more general study of human rhythm cognition. In doing so we compare a dynamical, embodied approach to a symbolic (traditional AI) one, paying particular attention to how the modeling approach brings with it tacit assumptions about what is being modeled. Please note that this is a philosophy paper about research that was, at the time of writing, very much in progress.
View details
Learning Simple Metrical Preferences in a Network of {F}itzhugh-{N}agumo Oscillators
The Proceedings of the Twenty-First Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey(1999)
Preview abstract
Hebbian learning is used to train a network of oscillators to prefer periodic signals of pulses over aperiodic signals. Target signals consisted of metronome-like voltage pulses with varying amounts of inter-onset noise injected. (with 0\% noise yielding a periodic signal and more noise yielding more and more aperiodic signals.) The oscillators---piecewise-linear approximations (Abbott, 1990) to Fitzhugh-Nagumo oscillators---are trained using mean phase coherence as an objective function. Before training a network is shown to readily synchronize with signals having wide range of noise. After training on a series of noise-free signals, a network is shown to only synchronize with signals having little or no noise. This represents a bias towards periodicity and is explained by strong positive coupling connections between oscillators having harmonically-related periods.
View details
Meter as Mechanism: A Neural Network Model that Learns Metrical patterns
Preview abstract
One kind of prosodic structure that apparently underlies both music and some examples of speech production is meter. Yet detailed measurements of the timing of both music and speech show that the nested periodicities that define metrical structure can be quite noisy in time. What kind of system could produce or perceive such variable metrical timing patterns? And what would it take to be able to store and reproduce particular metrical patterns from long-term memory? We have developed a network of coupled oscillators that both produces and perceives patterns of pulses that conform to particular meters. In addition, beginning with an initial state with no biases, it can learn to prefer the particular meter that it has been previously exposed to.
View details
An Exploration of Representational Complexity via Coupled Oscillators
T. Chemero
Proceedings of the Tenth Midwest Artificial Intelligence and Cognitive Science Society, MIT Press, Cambridge, Mass.(1999)
Preview abstract
We note some inconsistencies in a view of representation which takes {\it decoupling} to be of key importance. We explore these inconsistencies using examples of representational vehicles taken from coupled oscillator theory and suggest a new way to reconcile {\it coupling} with {\it absence}. Finally, we tie these views to a teleological definition of representation.
View details
Perception of Simple Rhythmic Patterns in a Network of Oscillators
M. Gasser
The Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates, New Jersey(1996)
Preview abstract
This paper is concerned with the complex capacity to recognize and reproduce rhythmic patterns. While this capacity has not been well investigated, in broad qualitative terms it is clear that people can learn to identify and produce recurring patterns defined in terms of sequences of beats of varying intensity and rests: the rhythms behind waltzes, reels, sambas, etc. Our short term goal is a model which is "hard-wired" with knowledge of a set of such patterns. Presented with a portion of one of the patterns or a label for a pattern, the model should reproduce the pattern and continue to do so when the input is turned off. Our long-term goal is a model which can learn to adjust the connection strengths which implement particular patterns as it is exposed to input patterns.
View details
Representing Rhythmic Patterns in a Network of Oscillators
M. Gasser
The Proceedings of the International Conference on Music Perception and Cognition, Lawrence Erlbaum Associates, New Jersey(1996), pp. 361-366
Preview abstract
This paper describes an evolving computational model of the perception and pro-duction of simple rhythmic patterns. The model consists of a network of oscillators of different resting frequencies which couple with input patterns and with each other. Os-cillators whose frequencies match periodicities in the input tend to become activated. Metrical structure is represented explicitly in the network in the form of clusters of os-cillators whose frequencies and phase angles are constrained to maintain the harmonic relationships that characterize meter. Rests in rhythmic patterns are represented by ex-plicit rest oscillators in the network, which become activated when an expected beat in the pattern fails to appear. The model makes predictions about the relative difficulty of patterns and the effect of deviations from periodicity in the input.
View details