Abstract
Semantically meaningful information content in perceptual signals is usually unevenly distributed. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high level variable-rate discrete representations of sequences, and apply them to speech signals. We show that the capacity of the resulting event-based representations automatically grows or shrinks depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.