Alex Alemi
Research Areas
Authored Publications
Sort By
Preview abstract
The paper introduces a new method for attempting to learn variational approximations to
Bayesian posterior predictive distributions that doesn’t require (1) the posterior predic-
tive distribution itself, (2) the posterior distribution (3) exact samples from the posterior
(4) or any test time marginalization.
View details
Density of States Estimation for Out of Distribution Detection
Cusuh Suh Ham
Josh Dillon
Warren Morningstar
AISTATS (2021)
Preview abstract
Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks.
View details
Preview abstract
Knowledge distillation is a popular technique for training a small student network to match a larger teacher model, such as an ensemble of networks.In this paper, we show that while knowledge distillation has a useful regularizing effect, it does not typically work as it is commonly understood:there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We show that the dataset used for distillation and the amount of temperature scaling applied to the logits play a crucial role in how closely the student matches the teacher, and discuss optimal ways of setting these hyper-parameters inpractice.
View details
Neural Tangents: Fast and Easy Infinite Neural Networks in Python
Roman Novak
Jiri Hron
Jaehoon Lee
Jascha Sohl-dickstein
Sam Schoenholz
ICLR (2020)
Preview abstract
Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space.
The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices.
Neural Tangents is available at
https://github.com/google/neural-tangents
We also provide an accompanying interactive Colab notebook at
https://colab.sandbox.google.com/github/neural-tangents/neural-tangents/blob/master/notebooks/neural_tangents_cookbook.ipynb
View details
Preview abstract
In order to gain insights into the generalization properties of deep neural networks, in this preliminary work we suggest studying the generalization properties of infinite ensembles of infinitely wide neural networks. Amazingly, this model family admits tractable calculations for many information theoretic quantities. Below we both derive these quantities and report some initial empirical investigations in the search for signals that correlate with generalization on both toy and real datasets.
View details
Preview abstract
In classic papers, Zellner [1988, 2002] demonstrated that Bayesian inference could
be derived as the solution to an information theoretic functional. Below we derive
a generalized form of this functional as a variational lower bound of a predictive
information bottleneck objective. This generalized functional encompasses most
modern inference procedures and suggests novel ones.
View details
Preview abstract
Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.
View details
Preview abstract
Without explictly being designed to do so, VIB (Alemi et al., 2017) gives two natural metrics for handling and quantifying uncertainty in neural networks. In this work we present a simple case study, demonstrating that VIB can improve a networks classification calibration as well as its ability to detect out of sample data.
View details
Preview abstract
In this paper, we investigate the degree to which the encoding of a β-VAE captures label information across multiple architectures on Binary Static MNIST and Omniglot. Even though they are trained in a completely unsupervised manner, we demonstrate that a β-VAE can retain a large amount of label information, even when asked to learn a highly compressed representation.
View details
Preview abstract
Graph embedding methods represent nodes in a continuous vector space, preserving information from the graph (e.g. by sampling random walks). There are many hyper-parameters to these methods (such as random walk length) which have to be manually tuned for every graph. In this paper, we replace random walk hyper-parameters with trainable parameters that we automatically learn via backpropagation. In particular, we learn a novel attention model on the power series of the transition matrix, which guides the random walk to optimize an upstream objective. Unlike previous approaches to attention models, the method that we propose utilizes attention parameters exclusively on the data (e.g. on the random walk), and not used by the model for inference. We experiment on link prediction tasks, as we aim to produce embeddings that best-preserve the graph structure, generalizing to unseen information. We improve state-of-the-art on a comprehensive suite of real world datasets including social, collaboration, and biological networks. Adding attention to random walks can reduce the error by 20% to 45% on datasets we attempted. Further, our learned attention parameters are different for every graph, and our automatically-found values agree with the optimal choice of hyper-parameter if we manually tune existing methods.
View details