Jump to Content
Nal Kalchbrenner

Nal Kalchbrenner

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Human-like perceptual similarity is an emergent property in the intermediate feature space of ImageNet-pretrained classifiers. Perceptual distances between images, as measured in the space of pre-trained image embeddings, have outperformed prior low-level metrics significantly on assessing image similarity. This has led to the wide adoption of perceptual distances as both an evaluation metric and an auxiliary training objective for image synthesis tasks. While image classification has improved by leaps and bounds, the de facto standard for computing perceptual distances uses older, less accurate models such as VGG and AlexNet. Motivated by this, we evaluate the perceptual scores of modern networks: ResNets, EfficientNets and VisionTransformers. Surprisingly, we observe an inverse correlation between ImageNet accuracy and perceptual scores: better classifiers achieve worse perceptual scores. We dive deeper into this, studying the ImageNet accuracy/perceptual score relationship under different hyperparameter configurations. Improving accuracy improves perceptual scores up to a certain point, but beyond this point we uncover a Pareto frontier between accuracies and perceptual scores. We explore this relationship further using distortion invariance, spatial frequency sensitivity and alternative perceptual functions. Based on our study, we find a ImageNet trained ResNet-6 network whose emergent perceptual score matches the best prior score obtained with networks trained explicitly on a perceptual similarity task. View details
    Preview abstract We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use an autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional self-attention blocks to effectively capture grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60\% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. View details
    Preview abstract Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators. View details
    Preview abstract Weather forecasting is a long standing scientific challenge with direct social and economic impact. The task is suitable for deep neural networks due to vast amounts of continuously collected data and a rich spatial and temporal structure that presents long range dependencies. We introduce MetNet, a neural network that forecasts precipitation up to 8 hours into the future at the high spatial resolution of 1 km and at the temporal resolution of 2 minutes with a latency in the order of seconds. MetNet takes as input radar and satellite data and forecast lead time and produces a probabilistic precipitation map. The architecture uses axial self-attention to aggregate the global context from a large input patch corresponding to a million square kilometers. We evaluate the performance of MetNet at various precipitation thresholds and find that MetNet outperforms Numerical Weather Prediction at forecasts of up to 7 to 8 hours on the scale of the continental United States. View details
    Preview abstract Bayesian inference promises to ground and improve the performance of deep neural networks. It promises to be robust to overfitting, to simplify the training procedure and the space of hyperparameters, and to provide a calibrated measure of uncertainty that can enhance decision making, agent exploration and prediction fairness. Markov Chain Monte Carlo (MCMC) methods enable Bayesian inference by generating samples from the posterior distribution over model parameters. Despite the theoretical advantages of Bayesian inference and the similarity between MCMC and optimization methods, the performance of sampling methods has so far lagged behind optimization methods for large scale deep learning tasks. We aim to fill this gap and introduce ATMC, an adaptive noise MCMC algorithm that estimates and is able to sample from the posterior of a neural network. ATMC dynamically adjusts the amount of momentum and noise applied to each parameter update in order to compensate for the use of stochastic gradients. We use a ResNet architecture without batch normalization to test ATMC on the Cifar10 benchmark and the large scale ImageNet benchmark and show that, despite the absence of batch normalization, ATMC outperforms a strong optimization baseline in terms of both classification accuracy and test log-likelihood. We show that ATMC is intrinsically robust to overfitting on the training data and that ATMC provides a better calibrated measure of uncertainty compared to the optimization baseline. View details
    Preview abstract The unconditional generation of high fidelity images is a longstanding benchmark for testing the performance of image decoders. Autoregressive image models have been able to generate small images unconditionally, but the extension of these methods to large images where fidelity can be more readily assessed has remained an open problem. Among the major challenges are the capacity to encode the vast previous context and the sheer difficulty of learning a distribution that preserves both global semantic coherence and exactness of detail. To address the former challenge, we propose the Subscale Pixel Network (SPN), a conditional decoder architecture that generates an image as a sequence of sub-images of equal size. The SPN compactly captures image-wide spatial dependencies and requires a fraction of the memory and the computation required by other fully autoregressive models. To address the latter challenge, we propose to use Multidimensional Upscaling to grow an image in both size and depth via intermediate stages utilising distinct SPNs. We evaluate SPNs on the unconditional generation of CelebAHQ of size 256 and of ImageNet from size 32 to 256. We achieve state-of-the-art likelihood results in multiple settings, set up new benchmark results in previously unexplored settings and are able to generate very high fidelity large scale samples on the basis of both datasets. View details
    Parallel WaveNet: Fast High-Fidelity Speech Synthesis
    Aäron van den Oord
    Yazhe Li
    Igor Babuschkin
    Karen Simonyan
    Koray Kavukcuoglu
    George van den Driessche
    Luis Carlos Cobo Rus
    Florian Stimberg
    Norman Casagrande
    Dominik Grewe
    Seb Noury
    Sander Dieleman
    Erich Elsen
    Alexander Graves
    Helen King
    Thomas Walters
    Demis Hassabis
    NA, Google Deepmind, NA (2017)
    Preview abstract The recently-developed WaveNet architecture [27] is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today’s massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices. View details
    Efficient Neural Audio Synthesis
    Erich Elsen
    Karen Simonyan
    Seb Noury
    Norman Casagrande
    Edward Lockhart
    Florian Stimberg
    Aaron van den Oord
    Sander Dieleman
    Koray Kavukcuoglu
    Proceedings of the 35th International Conference on Machine Learning, vol. PMLR 80 (2017), pp. 2410-2419 (to appear)
    Preview abstract Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency. View details
    Mastering the game of Go with deep neural networks and tree search
    David Silver
    Aja Huang
    Christopher J. Maddison
    Arthur Guez
    Laurent Sifre
    George van den Driessche
    Julian Schrittwieser
    Ioannis Antonoglou
    Veda Panneershelvam
    Marc Lanctot
    Sander Dieleman
    Dominik Grewe
    John Nham
    Ilya Sutskever
    Timothy Lillicrap
    Madeleine Leach
    Koray Kavukcuoglu
    Thore Graepel
    Demis Hassabis
    Nature, vol. 529 (2016), pp. 484-503
    Preview abstract The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated full-sized game of Go, a feat previously thought to be at least a decade away. View details
    Conditional Image Generation with PixelCNN Decoders
    Aäron van den Oord
    Koray Kavukcuoglu
    Alexander Graves
    Advances in Neural Information Processing Systems 29, Curran Associates, Inc. (2016), pp. 4790-4798 (to appear)
    Preview abstract This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-ofthe-art performance of PixelRNN on ImageNet, with greatly reduced computational cost. View details
    Pixel RNN
    Aäron van den Oord
    ICML (2016)
    Preview abstract Modelling the distribution of natural images is a landmark problem in unsupervised learning. We train a deep recurrent neural network to sequentially predict the pixels in an image. The network models the discrete joint probability of the raw pixel values. The distribution, though formally simple, can be arbitrarily complex and multimodal. The distribution is tractable and its ability to generalize is readily measured. Within a pixel the colors are also predicted sequentially and depend on each other and the previous context. We design two types of parallel spatial LSTM layers to make the network fast and scalable. Our main result is a compression score of 3.00 bits per color on CIFAR-10, which is considerably better than previous art. We also set new benchmarks on 32 x 32 and 64 x 64 ImageNet. Samples generated from the ImageNet model turn out general, sharp and globally coherent. View details
    WaveNet: A Generative Model for Raw Audio
    Aäron van den Oord
    Sander Dieleman
    Karen Simonyan
    Alexander Graves
    Koray Kavukcuoglu
    Arxiv (2016)
    Preview abstract This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition. View details
    Neural Machine Translation in Linear Time
    Karen Simonyan
    Aäron van den Oord
    Alexander Graves
    Koray Kavukcuoglu
    Arxiv (2016)
    Preview abstract We present a neural architecture for sequences, the ByteNet, that has two core features: it runs in time that is linear in the length of the sequences and it preserves the sequences' temporal resolution. The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source and one to decode the target, where the target decoder unfolds dynamically to generate variable length outputs. We show that the ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms recurrent neural networks. We also show that the ByteNet achieves a performance on raw character-level machine translation that approaches that of the best neural translation models that run in quadratic time. A visualization technique reveals the latent alignment structure learnt by the ByteNet. View details
    No Results Found