Jump to Content
Diederik P. (Durk) Kingma

Diederik P. (Durk) Kingma

I do research on principled and scalable methods for machine learning, with a focus on generative models. My contributions include the Variational Autoencoder (VAE), the Adam optimizer, Glow, and Variational Diffusion Models, but please see Scholar for a more complete list. I was part of the founding team of OpenAI in 2015, obtained a PhD (cum laude) from University of Amsterdam in 2017, and joined Google in 2018.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    On Linear Identifiability of Learned Representations
    Geoffrey Roeder
    Luke Metz
    Proceedings of ICML'21 (2021)
    Preview abstract Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to some downstream task. When parameterized as deep neural networks, such representation functions lack identifiability in parameter space, because they are over-parameterized by design. In this paper, building on recent advances in nonlinear Independent Components Analysis, we aim to rehabilitate identifiability by showing that a large family of discriminative models are in fact identifiable in function space, up to a linear indeterminacy. Many models for representation learning in a wide variety of domains have been identifiable in this sense, including text, images and audio, state-of-the-art at time of publication. We derive sufficient conditions for linear identifiability and provide empirical support for the result on both simulated and real-world data. View details
    Variational Diffusion Models
    Jonathan Ho
    Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (2021)
    Preview abstract Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. View details
    Preview abstract Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024x1024 images for the first time from a score-based generative model. View details
    Preview abstract We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow in the decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The inter-dependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on its preceding frames. The model allows for straightforward optimization towards the maximum likelihood objective, without utilizing intermediate spectral features nor additional loss terms. Contemporary state-of-the-art TTS systems use a sequence of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectograms) from text, followed by a vocoder model (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation ,and learns all parameters end-to-end. We demonstrate (to the best of our knowledge) the first system in the literature to do so successfully. Experiments show that the quality of speech generated from the proposed model is nearly competitive with the state-of-the-art neural TTS methods, with significantly improved generation speed. View details
    Preview abstract Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However, the unknown normalizing constant of EBMs makes training particularly difficult. Our goal is to provide a friendly introduction to modern approaches for EBM training. We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE). We highlight theoretical connections among these three approaches, and end with a brief survey on alternative training methods, which are still under active research. Our tutorial is targeted at an audience with basic understanding of generative models who want to apply EBMs or start a research project in this direction. View details
    Preview abstract While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. View details
    ICE-BeeM: Identifiable Conditional Energy-Based Deep Models Based on Nonlinear ICA
    Ilyes Khemakhem,
    Ricardo Monti
    Aapo Hyvarinen
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (2020)
    Preview abstract We consider the identifiability theory of probabilistic models and establish sufficient conditions under which the representations learnt by a very broad family of conditional energy-based models are unique in function space, up to a simple transformation. In our model family, the energy function is the dot-product between two feature extractors, one for the dependent variable, and one for the conditioning variable. We show that under mild conditions, the features are unique up to scaling and permutation. Our results extend recent developments in nonlinear ICA, and in fact, they lead to an important generalization of ICA models. In particular, we show that our model can be used for the estimation of the components in the framework of Independently Modulated Component Analysis (IMCA), a new generalization of nonlinear ICA that relaxes the independence assumption. A thorough empirical study show that representations learnt by our model from real-world image datasets are identifiable, and improve performance in transfer learning and semi-supervised learning tasks. View details
    VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation
    Mohammad Babaeizadeh
    Chelsea Finn
    Sergey Levine
    Laurent Dinh
    ICLR (2020) (to appear)
    Preview abstract Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video. View details
    Flow contrastive estimation of energy-based models
    Ruiqi Gao
    Erik Nijkamp
    Zhen Xu
    Andrew M Dai
    Ying Nian Wu
    Proceedings of CVPR'20 (2020)
    Preview abstract This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits.(1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution.(2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution.(3) Unlike generative adversarial networks (GAN) which estimates an implicit probability distribution defined by a generator model, our method estimates two explicit probabilistic distributions on the data. Using the proposed method we demonstrate a significant improvement on the synthesis quality of the flow model, and show the effectiveness of unsupervised feature learning by the learned energy-based model. Furthermore, the proposed training method can be easily adapted to semi-supervised learning. We achieve competitive results to the state-of-the-art semi-supervised learning methods. View details
    Preview abstract The framework of variational autoencoders allows us to efficiently learn deep latent-variable models, such that the model’s marginal distribution over observed variables fits the data. Often, we’re interested in going a step further, and want to approximate the true joint distribution over observed and latent variables, including the true prior and posterior distributions over latent variables. This is known to be generally impossible due to unidentifiability of the model. We address this issue by showing that for a broad family of deep latent-variable models, identification of the true joint distribution over observed and latent variables is actually possible up to very simple transformations, thus achieving a principled and powerful form of disentanglement. Our result requires a factorized prior distribution over the latent variables that is conditioned on an additionally observed variable, such as a class label or almost any other observation. We build on recent developments in nonlinear ICA, which we extend to the case with noisy, undercomplete or discrete observations, integrated in a maximum likelihood framework. The result also trivially contains identifiable flow-based generative models as a special case. View details
    An Introduction to Variational Autoencoders
    Max Welling
    Foundations and Trends in Machine Learning (2019)
    Preview abstract Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. In this work, we provide an introduction to variational autoencoders and some important extensions. View details
    Glow: Generative Flow with Invertible 1x1 Convolutions
    Prafulla Dhariwal
    Proceedings of NIPS'18 (2018)
    PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications
    Andrej Karpathy
    Xi Chen
    Proceedings of ICLR'17 (2017)
    Learning Sparse Neural Networks through Regularization
    Christos Louizos
    Max Welling
    Proceedings of ICLR'18 (2017)
    Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
    Proceedings of NIPS'16 (2016)
    Improved Variational Inference with Inverse Autoregressive Flow
    Rafal Jozefowicz
    Xi Chen
    Ilya Sutskever
    Max Welling
    Proceedings of NIPS'16 (2016)
    Markov Chain Monte Carlo and Variational Inference: Bridging the Gap
    Max Welling
    Proceedings of ICML'15 (2015)
    Adam: A Method for Stochastic Optimization
    Jimmy Ba
    Proceedings of ICLR'15 (2015)
    Variational Dropout and the Local Reparameterization Trick
    Max Welling
    Proceedings of NIPS'15 (2015)
    Auto-Encoding Variational Bayes
    Max Welling
    Proceedings of ICLR'14 (2014)
    Semi-Supervised Learning with Deep Generative Models
    Shakir Mohamed
    Danilo Jimenez Rezende
    Max Welling
    Proceedings of NIPS'14 (2014)