Jump to Content
Balaji Lakshminarayanan

Balaji Lakshminarayanan

I'm a research scientist in Google Brain. My recent research is focused on probabilistic deep learning, specifically, uncertainty estimation, out-of-distribution robustness and applications. Before joining Google Brain, I was a research scientist at DeepMind. I received my PhD from the Gatsby Unit, University College London where I worked with Yee Whye Teh. Please see my webpage for more info: http://www.gatsby.ucl.ac.uk/~balaji/
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We focus on the challenge of out-of-distribution (OOD) detection in deep learning models, a crucial aspect in ensuring reliability. Despite considerable effort, the problem remains significantly challenging in deep learning models due to their propensity to output over-confident predictions for OOD inputs. We propose a novel one-class open-set OOD detector that leverages text-image pre-trained models in a zero-shot fashion and incorporates various descriptions of in-domain and OOD. Our approach is designed to detect anything not in-domain and offers the flexibility to detect a wide variety of OOD, defined via fine- or coarse-grained labels, or even in natural language. We evaluate our approach on challenging benchmarks including large-scale datasets containing fine-grained, semantically similar classes, distributionally shifted images, and multi-object images containing a mixture of in-domain and OOD objects. Our method shows superior performance over previous methods on all benchmarks. View details
    Morse Neural Networks for Uncertainty Quantification
    Clara Huiyi Hu
    ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling (2023)
    Preview abstract We introduce a new deep generative model useful for uncertainty quantification: the Morse neural network, which generalizes the unnormalized Gaussian densities to have modes of high-dimensional submanifolds instead of just discrete points. Fitting the Morse neural network via a KL-divergence loss yields 1) a (unnormalized) generative density, 2) an OOD detector, 3) a calibration temperature, 4) a generative sampler, along with in the supervised case 6) a distance aware-classifier. The Morse network can be used on top of a pre-trained network to bring distance-aware calibration w.r.t the training data. Because of its versatility, the Morse neural networks unifies many techniques: e.g., the Entropic Out-of-Distribution Detector of (Macêdo et al., 2021) inOOD detection, the one class Deep Support Vector Description method of (Ruff et al., 2018) in anomaly detection, or the Contrastive One Class classifier in continuous learning (Sun et al., 2021).The Morse neural network has connections to sup-port vector machines, kernel methods, and Morse theory in topology. View details
    Preview abstract Improving the accuracy-fairness frontier of deep neural network (DNN) models is an important problem. Uncertainty-based active learning active learning (AL)can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple “plug-in” for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods. View details
    Plex: Towards Reliability using Pretrained Large Model Extensions
    Du Phan
    Mark Patrick Collier
    Zi Wang
    Zelda Mariet
    Clara Huiyi Hu
    Neil Band
    Tim G. J. Rudner
    Joost van Amersfoort
    Andreas Christian Kirsch
    Rodolphe Jenatton
    Honglin Yuan
    Kelly Buchanan
    Yarin Gal
    ICML 2022 Pre-training Workshop (2022)
    Preview abstract A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. View details
    Preview abstract Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines. View details
    Preview abstract Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To avoid models generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent `outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between model train, validation, and test sets. Unlike most traditional OOD benchmarks which detect dataset distribution shift, we aim at detecting semantic differences, often referred to as near-OOD detection which is a more difficult task. We propose a novel hierarchical outlier detection (HOD) approach, which assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers \vs{} outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD outperforms existing techniques for outlier exposure based OOD detection. We also use different state-of-the-art representation learning approaches (BiT-JFT, SimCLR, MICLe) to improve OOD performance and demonstrate the effectiveness of HOD loss for them. Further, we explore different ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also performed a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrated the gains of our framework in comparison to baselines. Furthermore, we go beyond traditional performance metrics and introduce a cost metric to approximate downstream clinical impact. We used this cost metric to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world deployment scenarios. View details
    Preview abstract The concern of overconfident mis-predictions under distributional shift demands extensive reliability research on Graph Neural Networks used in critical tasks in drug discovery. Here we first introduce CardioTox, a real-world benchmark on drug cardio-toxicity to facilitate such efforts. Our exploratory study shows overconfident mis-predictions are often distant from training data. That leads us to develop distance-aware GNNs: GNN-SNGP. Through evaluation on CardioTox and three established benchmarks, we demonstrate GNN-SNGP's effectiveness in increasing distance-awareness, reducing overconfident mis-predictions and making better calibrated predictions without sacrificing accuracy performance. Our ablation study further reveals the representation learned by GNN-SNGP improves distance-preservation over its base architecture and is one major factor for improvements. View details
    Deep Classifiers with Label Noise Modeling and Distance Awareness
    Vincent Fortuin
    Mark Patrick Collier
    Florian Wenzel
    James Urquhart Allingham
    Jesse Berent
    Rodolphe Jenatton
    NeurIPS 2021 Workshop on Bayesian Deep Learning (2021) (to appear)
    Preview abstract Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness of deep learning models, especially in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or respectively on input-dependent label uncertainties for in-distribution calibration, combining these two approaches has been less well explored. In this work, we propose to combine these two ideas to achieve a joint modeling of model (epistemic) and data (aleatoric) uncertainty. We show that our combined model affords a favorable combination between these two complementary types of uncertainty and thus achieves good performance in-distribution and out-of-distribution on different benchmark datasets. View details
    Combining Ensembles and Data Augmentation Can Harm Your Calibration
    Yeming Wen
    Ghassen Jerfel
    Rafael Rios Müller
    International Conference on Learning Representations (2021)
    Preview abstract Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model’s calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration and robustness across CIFAR-10, CIFAR-100, and ImageNet. View details
    Preview abstract Near out-of-distribution detection (OOD) is a major challenge for deep neural networks. We demonstrate that large-scale pre-training can significantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities. For instance, on CIFAR-100 vs CIFAR-10 OOD detection, we improve the AUROC from 85% (current SOTA) to more than 96% using Vision Transformers pre-trained on ImageNet21k. On a challenging genomics OOD detection benchmark, we improve the AUROC from 66% (current SOTA) to 77%. To further improve performance, we explore the few-shot outlier exposure setting where a few examples from outlier classes may be available; we show that pre-trained models are well-suited to outlier exposure, and that the AUROC of OOD detection on CIFAR-100 vs CIFAR-10 can be improved to 98.7% with just 1 image per OOD class, and 99.46% with 10 images per OOD class. We observe similar trends on genomics, achieving 85% with just 1 example per OOD class. For multi-modal image-text pre-trained models such as CLIP, we explore a new way of using just the names of outlier classes as a sole source of information (without any accompanying images) and show that this outperforms previous SOTA on several standard OOD benchmark tasks. View details
    Preview abstract Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks. View details
    Soft Calibration Objectives for Neural Networks
    Archit Karandikar
    Nick Cain
    Jon Shlens
    Michael C. Mozer
    Becca Roelofs
    Advances in Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract Optimal decision making requires that classifiers produce uncertainty estimates consistent with their empirical accuracy. However, deep neural networks are often under- or over-confident in their predictions. Consequently, methods have been developed to improve the calibration of their predictive uncertainty, both during training and post-hoc. In this work, we propose differentiable losses to improve calibration based on a soft (continuous) version of the binning operation underlying popular calibration-error estimators. When incorporated into training, these soft calibration losses achieve state-of-the-art single-model ECE across multiple datasets with less than 1% decrease in accuracy. For instance, we observe an 82% reduction in ECE (70% relative to the post-hoc rescaled ECE) in exchange for a 0.7% relative decrease in accuracy relative to the cross-entropy baseline on CIFAR-100. When incorporated post-training, the soft-binning-based calibration error objective improves upon temperature scaling, a popular recalibration method. Overall, experiments across losses and datasets demonstrate that using calibration- sensitive procedures yield better uncertainty estimates under dataset shift than the standard practice of using a cross-entropy loss and post-hoc recalibration methods. View details
    Training independent subnetworks for robust prediction
    Marton Havasi
    Rodolphe Jenatton
    Stanislav Fort
    International Conference on Learning Representations (2021)
    Preview abstract Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods. View details
    Preview abstract Bayesian neural networks (BNNs) demonstrate promising success in improving the robustness and uncertainty quantification of modern neural networks. However, they generally struggle with underfitting at scale and parameter efficiency. On the other hand, deep ensembles have emerged as an alternative for uncertainty quantification that, while outperforming BNNs on certain problems, also suffers from efficiency issues. It remains unclear how to combine the strengths of these two approaches and remediate their common issues. To tackle this challenge, we propose a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution on a rank-1 subspace. We also revisit the use of mixture approximate posteriors to capture multiple modes where unlike typical mixtures, this approach admits a significantly smaller memory increase (e.g., only a 0.4% increase for a ResNet-50 mixture of size 10). We perform a systematic empirical study on the choices of prior, variational posterior, and methods to improve training. For ResNet-50 on ImageNet and Wide ResNet 28-10 on CIFAR-10/100, rank-1 BNNs outperform baselines across log-likelihood, accuracy, and calibration on the test set and out-of-distribution variants. View details
    Preview abstract Bayesian neural networks (BNN) and Deep Ensembles are principled approaches to estimate the predictive uncertainty of a deep learning model. However their practicality in real-time, industrial-scale applications are limited due to their heavy memory and inference cost. This motivates us to study principled approaches to high-quality uncertainty estimation that require only a single deep neural network (DNN). By formalizing the uncertainty quantification as a minimax learning problem, we first identify \textit{input distance awareness}, i.e., the model’s ability in quantifying the distance of a testing example from the training data in the input space, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose \textit{Spectral-normalized Gaussian Process} (SNGP), a simple method that improves the distance-awareness ability of modern DNNs, by adding a weight normalization step during training and replacing the activation of the penultimate layer. We visually illustrate the property of the proposed method on two-dimensional datasets, and benchmark its performance against Deep Ensembles and other single-model approaches across both vision and language understanding tasks and on modern architectures (ResNet and BERT). Despite its simplicity, SNGP is competitive with Deep Ensembles in prediction, calibration and out-of-domain detection, and significantly outperforms the other single-model approaches. View details
    Preview abstract Accurate estimation of predictive uncertainty in modern neural networks is critical to achieve well calibrated predictions and detect out-of-distribution inputs. The most promising approaches have been predominantly focused on improving model uncertainty (e.g. deep ensembles and Bayesian neural networks) and post-processing techniques for out-of-distribution detection (e.g. ODIN and Mahalanobis distance). However, there has been relatively little investigation into how the parametrization of the probabilities in discriminative classifiers affects the uncertainty estimates, and the dominant method, softmax cross-entropy, results in misleadingly high confidences on out-of-distribution data and under covariate shift. We investigate alternative ways of formulating probabilities using (1) a one-vs-all formulation to capture the notion of “none of the above”, and (2) a distance-based logit representation to encode uncertainty as a function of distance to the training manifold. We show that one-vs-all formulations can match the predictive performance of softmax without incurring any additional training or test-time complexity, and improve calibration on image classification tasks. View details
    Preview abstract Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve distributions that are skewed from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under conditions of distributional skew. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of distributional skew on accuracy and calibration. We find that traditional post-hoc calibration falls short and some Bayesian methods are intractable for very large data. However, methods that marginalize over models give surprisingly strong results across a broad spectrum. View details
    Preview abstract Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images. View details
    Preview abstract Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players’ parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step. View details
    Clinically applicable deep learning for diagnosis and referral in retinal optical coherence tomography
    Jeffrey De Fauw
    Bernardino Romera Paredes
    Stanislav Nikolov Nikolov
    Nenad Tomašev
    Sam Julian Blackwell
    Harry Askham
    Xavier Glorot
    Brendan O'Donoghue
    Daniel James Visentin
    George van den Driessche
    Clemens Meyer
    Faith Mackinder
    Simon Bouton
    Kareem Ayoub
    Reena Chopra
    Dominic King
    Cían Hughes
    Rosalind Raine
    Julian Hughes
    Dawn Sim
    Catherine Egan
    Adnan Tufail
    Hugh Montgomery
    Demis Hassabis
    Geraint Rees
    Trevor John Back
    Peng Khaw
    Mustafa Suleyman
    Julien Cornebise
    Pearse Keane
    Olaf Ronneberger
    Nature (2018)
    Preview abstract The volume and complexity of diagnostic imaging is increasing at a pace faster than the availability of human expertise to interpret it. Artificial intelligence has shown great promise in classifying two-dimensional photographs of some common diseases and typically relies on databases of millions of annotated images. Until now, the challenge of reaching the performance of expert clinicians in a real-world clinical pathway with three-dimensional diagnostic scans has remained unsolved. Here, we apply a novel deep learning architecture to a clinically heterogeneous set of three-dimensional optical coherence tomography scans from patients referred to a major eye hospital. We demonstrate performance in making a referral recommendation that reaches or exceeds that of experts on a range of sight-threatening retinal diseases after training on only 14,884 scans. Moreover, we demonstrate that the tissue segmentations produced by our architecture act as a device-independent representation; referral accuracy is maintained when using tissue segmentations from a different type of device. Our work removes previous barriers to wider clinical use without prohibitive training data requirements across multiple pathologies in a real-world setting. View details
    No Results Found