Ian Simon

Ian Simon

I joined Google in July 2016 to work on the Magenta project. Our goal is to create music and other art using machine intelligence. Before joining Google, I worked on music recommendations at Smule and music recognition at Microsoft. I received my PhD in Computer Science from the University of Washington, doing work with Steve Seitz on understanding scenes via large online image collections.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Automatic Music Transcription (AMT), in particular the problem of automatically extracting notes from audio, has seen much recent progress via the training of neural network models on musical audio recordings paired with aligned ground-truth note labels. However, progress is currently limited by the difficulty of obtaining such note labels for natural audio recordings at scale. In this paper, we take advantage of the fact that for monophonic music, the transcription problem is much easier and largely solved via modern pitch-tracking methods. Specifically, we show that we are able to combine recordings of real monophonic music (and their transcriptions) into artificial and musically-incoherent mixtures, greatly increasing the scale of labeled training data. By pretraining on these mixtures, we can use a larger neural network model and significantly improve upon the state of the art in multi-instrument polyphonic transcription. We demonstrate this improvement across a variety of datasets and in a ``zero-shot'' setting where the model has not been trained on any data from the evaluation domain. View details
    MT3: Multi-task Multitrack Music Transcription
    Josh Gardner
    Curtis Glenn-Macway Hawthorne
    ICLR 2022 (to appear)
    Preview abstract Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are ``low resource'', as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. View details
    Preview abstract Data is the lifeblood of modern machine learning systems, including for those in Music Information Retrieval (MIR). However, MIR has long been mired by small datasets and unreliable labels. In this work, we propose to break this bottleneck using generative models. By pipelining a generative model of notes (Coconet trained on Bach Chorales) with a structured synthesis model of chamber ensembles (MIDI-DDSP trained on URMP), we demonstrate a system capable of producing unlimited amounts of realistic chorale music with rich annotations including mixes, stems, MIDI, note-level performance attributes (staccato, vibrato, etc.), and even fine-grained synthesis parameters (pitch, amplitude, etc.). We call this system the \textbf{Chamber Ensemble Generator (CEG)}, and use it to generate a large dataset of chorales from four different chamber ensembles (CocoChorales). We demonstrate that data generated using our approach improves state-of-the-art models for music transcription and source separation, and we release both the system and the dataset as an open-source foundation for future work in the MIR community. View details
    Sequence-to-Sequence Piano Transcription with Transformers
    Curtis Glenn-Macway Hawthorne
    Rigel Jacob Swavely
    ISMIR (2021) (to appear)
    Preview abstract Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like outputs for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design. View details
    Symbolic Music Generation with Diffusion Models
    Gautam Mittal
    Curtis Glenn-Macway Hawthorne
    ISMIR 2021 (2021) (to appear)
    Preview abstract Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process of a Markov chain and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show promising unconditional generation results compared to an autoregressive language model operating over the same continuous embeddings. View details
    Encoding Musical Style with Transformer Autoencoders
    Kristy Choi
    Curtis Hawthorne
    Monica Dinculescu
    ICML (2020)
    Preview abstract Music Transformer is a recently developed generative model that leverages self-attention based on relative positioning to achieve state-of-the-art music generation. However, adapting the trained generative model to user preferences has proven to be cumbersome. In this work, we propose a variety of techniques to enable more fine-grained control of user input. Specifically, we condition on performance and melody inputs to learn musical representations that generalize well across a variety of different musical tasks. Empirically, we demonstrate the effectiveness of our method on the MAESTRO dataset and an internal 10,000+ hour dataset of YouTube piano performances. We achieve improvements in terms of log-likelihood and improvements in terms of mean listening scores. View details
    Music Transformer: Generating Music with Long-Term Structure
    Ashish Vaswani
    Jakob Uszkoreit
    Noam Shazeer
    Curtis Hawthorne
    Matt Hoffman
    Monica Dinculescu
    ICLR (2019)
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    Piano Genie
    Chris Donahue
    Sander Dieleman
    ACM IUI (2019)
    Preview abstract We present Piano Genie, a generative musical instrument which allows non-musicians to play the piano. With Piano Genie, a user performs on a simple interface with eight buttons, and their performance is decoded into the space of plausible piano music in real time. To learn a suitable mapping procedure for this problem, we train recurrent neural network autoencoders with discrete bottlenecks: an encoder learns an appropriate sequence of buttons corresponding to a piano piece, and a decoder learns to map this sequence back to the original piece. During performance, we substitute a user’s input for the encoder output, and play the decoder’s prediction each time the user presses a button. To improve the interpretability of Piano Genie’s performance mechanics, we impose musically-informed constraints over the encoder’s outputs. View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    Onsets and Frames: Dual-Objective Piano Transcription
    Curtis Hawthorne
    Erich Elsen
    Jialin Song
    Colin Raffel
    Sageev Oore
    Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, 2018
    Preview abstract We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions. View details