Jump to Content

The Beta VAE's Implicit Prior

Carlos Riquelme
Matthew Johnson
NIPS Workshop (2017)


Variational autoencoders are a popular and powerful class of deep generative models. They resemble a classical autoencoder, except that the latent code $z=f(x)$ is replaced with a \emph{distribution} $q(z\mid x)$ over latent codes, and this distribution is regularized to have small KL divergence to a (usually pre-specified) marginal distribution $p(z)$. If the reconstruction log-likelihood $\Eq[\log p(x\mid z)]$ has the same weight as the KL-divergence penalty $\Eq[\log \frac{p(z)}{q(z\mid x)}]$, then the training procedure can be interpreted as maximizing a bound on the marginal likelihood $p(x)$ (sometimes called the evidence lower bound or ELBO). However, recent work has explored applying different weights to the KL-divergence term, either to alleviate optimization issues during training or to exert greater control over the sorts of latent spaces that get learned. Following Higgins et al. (2017), we will call models fit with this approach "beta-VAEs". Below, we will analyze beta-VAEs where the KL-divergence weight beta<1. We will argue that optimizing this partially regularized ELBO is equivalent to doing approximate variational EM with an implicit prior r(z) that depends on the marginal posterior q(z)\triangleq\frac{1}{N}\sum_n q(z\mid x_n), with one main difference; it ignores the normalizing constant of this implicit distribution. We show how to estimate this missing normalizing constant, and demonstrate that beta-VAEs with beta<1 can actually achieve higher held-out likelihoods than standard VAEs.