Balaji Lakshminarayanan

Balaji Lakshminarayanan

I'm a research scientist in Google Brain. My recent research is focused on probabilistic deep learning, specifically, uncertainty estimation, out-of-distribution robustness and applications. Before joining Google Brain, I was a research scientist at DeepMind. I received my PhD from the Gatsby Unit, University College London where I worked with Yee Whye Teh. Please see my webpage for more info: http://www.gatsby.ucl.ac.uk/~balaji/
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We focus on the challenge of out-of-distribution (OOD) detection in deep learning models, a crucial aspect in ensuring reliability. Despite considerable effort, the problem remains significantly challenging in deep learning models due to their propensity to output over-confident predictions for OOD inputs. We propose a novel one-class open-set OOD detector that leverages text-image pre-trained models in a zero-shot fashion and incorporates various descriptions of in-domain and OOD. Our approach is designed to detect anything not in-domain and offers the flexibility to detect a wide variety of OOD, defined via fine- or coarse-grained labels, or even in natural language. We evaluate our approach on challenging benchmarks including large-scale datasets containing fine-grained, semantically similar classes, distributionally shifted images, and multi-object images containing a mixture of in-domain and OOD objects. Our method shows superior performance over previous methods on all benchmarks. View details
    Morse Neural Networks for Uncertainty Quantification
    Clara Huiyi Hu
    ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling(2023)
    Preview abstract We introduce a new deep generative model useful for uncertainty quantification: the Morse neural network, which generalizes the unnormalized Gaussian densities to have modes of high-dimensional submanifolds instead of just discrete points. Fitting the Morse neural network via a KL-divergence loss yields 1) a (unnormalized) generative density, 2) an OOD detector, 3) a calibration temperature, 4) a generative sampler, along with in the supervised case 6) a distance aware-classifier. The Morse network can be used on top of a pre-trained network to bring distance-aware calibration w.r.t the training data. Because of its versatility, the Morse neural networks unifies many techniques: e.g., the Entropic Out-of-Distribution Detector of (Macêdo et al., 2021) inOOD detection, the one class Deep Support Vector Description method of (Ruff et al., 2018) in anomaly detection, or the Contrastive One Class classifier in continuous learning (Sun et al., 2021).The Morse neural network has connections to sup-port vector machines, kernel methods, and Morse theory in topology. View details
    Preview abstract Improving the accuracy-fairness frontier of deep neural network (DNN) models is an important problem. Uncertainty-based active learning active learning (AL)can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple “plug-in” for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods. View details
    Preview abstract Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines. View details
    Plex: Towards Reliability using Pretrained Large Model Extensions
    Du Phan
    Mark Patrick Collier
    Zi Wang
    Zelda Mariet
    Clara Huiyi Hu
    Neil Band
    Tim G. J. Rudner
    Karan Singhal
    Joost van Amersfoort
    Andreas Christian Kirsch
    Rodolphe Jenatton
    Honglin Yuan
    Kelly Buchanan
    Yarin Gal
    ICML 2022 Pre-training Workshop(2022)
    Preview abstract A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. View details
    Preview abstract Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks. View details
    Preview abstract Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To avoid models generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent `outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between model train, validation, and test sets. Unlike most traditional OOD benchmarks which detect dataset distribution shift, we aim at detecting semantic differences, often referred to as near-OOD detection which is a more difficult task. We propose a novel hierarchical outlier detection (HOD) approach, which assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers \vs{} outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD outperforms existing techniques for outlier exposure based OOD detection. We also use different state-of-the-art representation learning approaches (BiT-JFT, SimCLR, MICLe) to improve OOD performance and demonstrate the effectiveness of HOD loss for them. Further, we explore different ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also performed a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrated the gains of our framework in comparison to baselines. Furthermore, we go beyond traditional performance metrics and introduce a cost metric to approximate downstream clinical impact. We used this cost metric to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world deployment scenarios. View details
    Preview abstract The concern of overconfident mis-predictions under distributional shift demands extensive reliability research on Graph Neural Networks used in critical tasks in drug discovery. Here we first introduce CardioTox, a real-world benchmark on drug cardio-toxicity to facilitate such efforts. Our exploratory study shows overconfident mis-predictions are often distant from training data. That leads us to develop distance-aware GNNs: GNN-SNGP. Through evaluation on CardioTox and three established benchmarks, we demonstrate GNN-SNGP's effectiveness in increasing distance-awareness, reducing overconfident mis-predictions and making better calibrated predictions without sacrificing accuracy performance. Our ablation study further reveals the representation learned by GNN-SNGP improves distance-preservation over its base architecture and is one major factor for improvements. View details
    Deep Classifiers with Label Noise Modeling and Distance Awareness
    Vincent Fortuin
    Mark Patrick Collier
    Florian Wenzel
    James Urquhart Allingham
    Jesse Berent
    Rodolphe Jenatton
    NeurIPS 2021 Workshop on Bayesian Deep Learning(2021) (to appear)
    Preview abstract Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness of deep learning models, especially in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or respectively on input-dependent label uncertainties for in-distribution calibration, combining these two approaches has been less well explored. In this work, we propose to combine these two ideas to achieve a joint modeling of model (epistemic) and data (aleatoric) uncertainty. We show that our combined model affords a favorable combination between these two complementary types of uncertainty and thus achieves good performance in-distribution and out-of-distribution on different benchmark datasets. View details
    Combining Ensembles and Data Augmentation Can Harm Your Calibration
    Yeming Wen
    Ghassen Jerfel
    Rafael Rios Müller
    International Conference on Learning Representations(2021)
    Preview abstract Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model’s calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration and robustness across CIFAR-10, CIFAR-100, and ImageNet. View details