Jump to Content

Visualizing Music Self-Attention

Monica Dinculescu
Ashish Vaswani
NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language (2018)

Abstract

Like language, music can be represented as a sequence of discrete symbols that form a hierarchical syntax, with notes being roughly like characters and motifs of notes like words. Unlike text however, music relies heavily on repetition on multiple timescales to build structure and meaning. The Music Transformer has shown compelling results in generating music with structure~\citep{huang2018music}. In this paper, we introduce a tool for visualizing self-attention on polyphonic music with an interactive pianoroll. We use music transformer as both a descriptive tool and a generative model. For the former, we use it to analyze existing music to see if the resulting self-attention structure corroborates with the musical structure known from music theory. For the latter, we inspect the model's self-attention during generation, in order to understand how past notes affect future ones. We also compare and contrast the attention structure of regular attention to that of relative attention \citep{shaw2018self, huang2018music}, and examine its impact on the resulting generated music. For example, for the JSB Chorales dataset, a model trained with relative attention is more consistent in attending to all the voices in the preceding timestep and the chords before, and at cadences to the beginning of a phrase, allowing it to create an arc. We hope that our analyses will offer more evidence for relative self-attention as a powerful inductive bias for modeling music. We invite the reader to checkout video animations of music attention and interact with the visualizations at \url{https://storage.googleapis.com/nips-workshop-visualization/index.html}.

Research Areas