Nal Kalchbrenner
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
Human-like perceptual similarity is an emergent property in the intermediate feature space of ImageNet-pretrained classifiers. Perceptual distances between images, as measured in the space of pre-trained image embeddings, have outperformed prior low-level metrics significantly on assessing image similarity. This has led to the wide adoption of perceptual distances as both an evaluation metric and an auxiliary training objective for image synthesis tasks. While image classification has improved by leaps and bounds, the de facto standard for computing perceptual distances uses older, less accurate models such as VGG and AlexNet. Motivated by this, we evaluate the perceptual scores of modern networks: ResNets, EfficientNets and VisionTransformers. Surprisingly, we observe an inverse correlation between ImageNet accuracy and perceptual scores: better classifiers achieve worse perceptual scores. We dive deeper into this, studying the ImageNet accuracy/perceptual score relationship under different hyperparameter configurations. Improving accuracy improves perceptual scores up to a certain point, but beyond this point we uncover a Pareto frontier between accuracies and perceptual scores. We explore this relationship further using distortion invariance, spatial frequency sensitivity and alternative perceptual functions. Based on our study, we find a ImageNet trained ResNet-6 network whose emergent perceptual score matches the best prior score obtained with networks trained explicitly on a perceptual similarity task.
View details
Preview abstract
We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use an autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional self-attention blocks to effectively capture grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60\% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth.
View details
A Spectral Energy Distance for Parallel Speech Synthesis
Rianne van den Berg
(2020)
Preview abstract
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.
View details
MetNet: A Neural Weather Model for Precipitation Forecasting
Casper Kaae Sønderby
Avital Oliver
Jason Hickey
Submission to journal (2020)
Preview abstract
Weather forecasting is a long standing scientific challenge with direct social and economic impact. The task is suitable for deep neural networks due to vast amounts of continuously collected data and a rich spatial and temporal structure that presents long range dependencies. We introduce MetNet, a neural network that forecasts precipitation up to 8 hours into the future at the high spatial resolution of 1 km and at the temporal resolution of 2 minutes with a latency in the order of seconds. MetNet takes as input radar and satellite data and forecast lead time and produces a probabilistic precipitation map. The architecture uses axial self-attention to aggregate the global context from a large input patch corresponding to a million square kilometers. We evaluate the performance of MetNet at various precipitation thresholds and find that MetNet outperforms Numerical Weather Prediction at forecasts of up to 7 to 8 hours on the scale of the continental United States.
View details
Preview abstract
Bayesian inference promises to ground and improve the performance of deep neural networks. It promises to be robust to overfitting, to simplify the training procedure and the space of hyperparameters, and to provide a calibrated measure of uncertainty that can enhance decision making, agent exploration and prediction fairness.
Markov Chain Monte Carlo (MCMC) methods enable Bayesian inference by generating samples from the posterior distribution over model parameters.
Despite the theoretical advantages of Bayesian inference and the similarity between MCMC and optimization methods, the performance of sampling methods has so far lagged behind optimization methods for large scale deep learning tasks.
We aim to fill this gap and introduce ATMC, an adaptive noise MCMC algorithm that estimates and is able to sample from the posterior of a neural network.
ATMC dynamically adjusts the amount of momentum and noise applied to each parameter update in order to compensate for the use of stochastic gradients.
We use a ResNet architecture without batch normalization to test ATMC on the Cifar10 benchmark and the large scale ImageNet benchmark and show that, despite the absence of batch normalization, ATMC outperforms a strong optimization baseline in terms of both classification accuracy and test log-likelihood. We show that ATMC is intrinsically robust to overfitting on the training data and that ATMC provides a better calibrated measure of uncertainty compared to the optimization baseline.
View details
Preview abstract
The unconditional generation of high fidelity images is a longstanding benchmark
for testing the performance of image decoders. Autoregressive image models have
been able to generate small images unconditionally, but the extension of these
methods to large images where fidelity can be more readily assessed has remained
an open problem. Among the major challenges are the capacity to encode the vast
previous context and the sheer difficulty of learning a distribution that preserves
both global semantic coherence and exactness of detail. To address the former
challenge, we propose the Subscale Pixel Network (SPN), a conditional decoder
architecture that generates an image as a sequence of sub-images of equal size. The
SPN compactly captures image-wide spatial dependencies and requires a fraction
of the memory and the computation required by other fully autoregressive models.
To address the latter challenge, we propose to use Multidimensional Upscaling
to grow an image in both size and depth via intermediate stages utilising distinct
SPNs. We evaluate SPNs on the unconditional generation of CelebAHQ of size
256 and of ImageNet from size 32 to 256. We achieve state-of-the-art likelihood
results in multiple settings, set up new benchmark results in previously unexplored
settings and are able to generate very high fidelity large scale samples on the basis
of both datasets.
View details
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Aäron van den Oord
Yazhe Li
Igor Babuschkin
Karen Simonyan
Koray Kavukcuoglu
George van den Driessche
Luis Carlos Cobo Rus
Florian Stimberg
Norman Casagrande
Dominik Grewe
Seb Noury
Sander Dieleman
Erich Elsen
Alexander Graves
Helen King
Thomas Walters
Demis Hassabis
NA, Google Deepmind, NA (2017)
Preview abstract
The recently-developed WaveNet architecture [27] is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today’s massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.
View details
Efficient Neural Audio Synthesis
Erich Elsen
Karen Simonyan
Seb Noury
Norman Casagrande
Edward Lockhart
Florian Stimberg
Aaron van den Oord
Sander Dieleman
Koray Kavukcuoglu
Proceedings of the 35th International Conference on Machine Learning, vol. PMLR 80 (2017), pp. 2410-2419 (to appear)
Preview abstract
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.
View details
Mastering the game of Go with deep neural networks and tree search
David Silver
Aja Huang
Christopher J. Maddison
Arthur Guez
Laurent Sifre
George van den Driessche
Julian Schrittwieser
Ioannis Antonoglou
Veda Panneershelvam
Marc Lanctot
Sander Dieleman
Dominik Grewe
John Nham
Ilya Sutskever
Timothy Lillicrap
Madeleine Leach
Koray Kavukcuoglu
Thore Graepel
Demis Hassabis
Nature, vol. 529 (2016), pp. 484-503
Preview abstract
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated full-sized game of Go, a feat previously thought to be at least a decade away.
View details
Preview abstract
Modelling the distribution of natural images is a landmark problem in unsupervised learning.
We train a deep recurrent neural network to sequentially predict the pixels in an image. The network models the discrete joint probability of the raw pixel values. The distribution, though formally simple, can be arbitrarily complex and multimodal. The distribution is tractable and its ability to generalize is readily measured.
Within a pixel the colors are also predicted sequentially and depend on each other and the previous context. We design two types of parallel spatial LSTM layers to make the network fast and scalable.
Our main result is a compression score of 3.00 bits per color on CIFAR-10, which is considerably better than previous art. We also set new benchmarks on 32 x 32 and 64 x 64 ImageNet. Samples generated from the ImageNet model turn out general, sharp and globally coherent.
View details
Neural Machine Translation in Linear Time
Karen Simonyan
Aäron van den Oord
Alexander Graves
Koray Kavukcuoglu
Arxiv (2016)
Preview abstract
We present a neural architecture for sequences, the ByteNet, that has two core features: it runs in time that is linear in the length of the sequences and it preserves the sequences' temporal resolution. The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source and one to decode the target, where the target decoder unfolds dynamically to generate variable length outputs. We show that the ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms recurrent neural networks. We also show that the ByteNet achieves a performance on raw character-level machine translation that approaches that of the best neural translation models that run in quadratic time. A visualization technique reveals the latent alignment structure learnt by the ByteNet.
View details
Conditional Image Generation with PixelCNN Decoders
Aäron van den Oord
Koray Kavukcuoglu
Alexander Graves
Advances in Neural Information Processing Systems 29, Curran Associates, Inc. (2016), pp. 4790-4798 (to appear)
Preview abstract
This work explores conditional image generation with a new image density model
based on the PixelCNN architecture. The model can be conditioned on any vector,
including descriptive labels or tags, or latent embeddings created by other networks.
When conditioned on class labels from the ImageNet database, the model is able to
generate diverse, realistic scenes representing distinct animals, objects, landscapes
and structures. When conditioned on an embedding produced by a convolutional
network given a single image of an unseen face, it generates a variety of new
portraits of the same person with different facial expressions, poses and lighting
conditions. We also show that conditional PixelCNN can serve as a powerful
decoder in an image autoencoder. Additionally, the gated convolutional layers in
the proposed model improve the log-likelihood of PixelCNN to match the state-ofthe-art performance of PixelRNN on ImageNet, with greatly reduced computational
cost.
View details
WaveNet: A Generative Model for Raw Audio
Aäron van den Oord
Sander Dieleman
Karen Simonyan
Alexander Graves
Koray Kavukcuoglu
Arxiv (2016)
Preview abstract
This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition.
View details
No Results Found