Alex Alemi

Alex Alemi

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The paper introduces a new method for attempting to learn variational approximations to Bayesian posterior predictive distributions that doesn’t require (1) the posterior predic- tive distribution itself, (2) the posterior distribution (3) exact samples from the posterior (4) or any test time marginalization. View details
    Does Knowledge Distillation Really Work?
    Samuel Stanton
    Pavel Izmailov
    Polina Kirichenko
    Andrew Gordon Wilson
    NeurIPS (2021)
    Preview abstract Knowledge distillation is a popular technique for training a small student network to match a larger teacher model, such as an ensemble of networks.In this paper, we show that while knowledge distillation has a useful regularizing effect, it does not typically work as it is commonly understood:there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We show that the dataset used for distillation and the amount of temperature scaling applied to the logits play a crucial role in how closely the student matches the teacher, and discuss optimal ways of setting these hyper-parameters inpractice. View details
    Preview abstract Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks. View details
    Neural Tangents: Fast and Easy Infinite Neural Networks in Python
    Roman Novak
    Jiri Hron
    Jaehoon Lee
    Jascha Sohl-dickstein
    Sam Schoenholz
    ICLR (2020)
    Preview abstract Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices. Neural Tangents is available at https://github.com/google/neural-tangents We also provide an accompanying interactive Colab notebook at https://colab.sandbox.google.com/github/neural-tangents/neural-tangents/blob/master/notebooks/neural_tangents_cookbook.ipynb View details
    Preview abstract Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning. View details
    Preview abstract In classic papers, Zellner [1988, 2002] demonstrated that Bayesian inference could be derived as the solution to an information theoretic functional. Below we derive a generalized form of this functional as a variational lower bound of a predictive information bottleneck objective. This generalized functional encompasses most modern inference procedures and suggests novel ones. View details
    Preview abstract In order to gain insights into the generalization properties of deep neural networks, in this preliminary work we suggest studying the generalization properties of infinite ensembles of infinitely wide neural networks. Amazingly, this model family admits tractable calculations for many information theoretic quantities. Below we both derive these quantities and report some initial empirical investigations in the search for signals that correlate with generalization on both toy and real datasets. View details
    Preview abstract We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAEs trained on MNIST and discuss the results. View details
    Preview abstract Without explictly being designed to do so, VIB (Alemi et al., 2017) gives two natural metrics for handling and quantifying uncertainty in neural networks. In this work we present a simple case study, demonstrating that VIB can improve a networks classification calibration as well as its ability to detect out of sample data. View details
    Preview abstract In this preliminary and speculative work, we offer a unique perspective and framework to think about a wide class of existing objectives in Machine Learning. We discuss its implications, and identify some formal connections to Thermodynamics. View details