Mario Lucic

Mario Lucic

Mario Lucic is a Senior Staff Research Scientist at Google DeepMind working on the Gemini team where he is co-leading the efforts on video and audio-video understanding. Since joining Google he worked on GenAI and large-scale vision and multimodal systems. He received his PhD from ETH Zurich where he investigated efficient and theoretically grounded data selection algorithms. He is serving as the Area Chair for NeurIPS and ICLR conferences.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Preview abstract Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations. View details
    Preview abstract We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research. View details
    Which Model to Transfer? Finding the Needle in the Growing Haystack
    Cedric Renggli
    André Susano Pinto
    Luka Rimanic
    Carlos Riquelme
    Ce Zhang
    Conference on Computer Vision and Pattern Recognition(2022) (to appear)
    Preview abstract Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular for computer vision tasks where it provides a remarkably solid baseline. The emergence of rich model repositories, such as TensorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. We provide a formalization of this problem through a familiar notion of regret and introduce the predominant strategies, namely task-agnostic (e.g. ranking models by their ImageNet performance) and task-aware search strategies (such as linear or kNN evaluation). We conduct a large-scale empirical study and show that both task-agnostic and task-aware methods can yield high regret. We then propose a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. We highlight the practical benefits of the proposed solution on a set of 19 diverse vision tasks. View details
    Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
    Henning Meyer
    Urs Bergmann
    Klaus Greff
    Noha Radwan
    Alexey Dosovitskiy
    Jakob Uszkoreit
    Tom Funkhouser
    Conference on Computer Vision and Pattern Recognition (CVPR)(2022)
    Preview abstract A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery. View details
    Object Scene Representation Transformer
    Filip Pavetić
    Leonidas Guibas
    Klaus Greff
    Thomas Kipf
    Advances in Neural Information Processing Systems(2022), pp. 9512-9524
    Preview abstract A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. View details
    Preview abstract We present an efficient and scalable algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. Unlike previous black-box reduction methods to cost-sensitive classification rules, the proposed algorithm operates on models that have been trained without having to retrain the model. Furthermore, as the algorithm is based on projected stochastic gradient descent (SGD), it is particularly attractive for deep learning applications. We empirically validate the proposed algorithm on standard benchmark datasets across both classical algorithms and modern DNN architectures and demonstrate that it outperforms previous post-processing approaches for unbiased classification. View details
    Preview abstract We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video. We report competitive results on 19 transfer learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8 out-of-distribution-generalization tasks, and discuss the benefits and shortcomings of the proposed approach. In particular, it improves over the baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution generalization tasks. Finally, we perform several ablation studies and analyze the impact of the pretrained object detector on the performance across this suite of tasks. View details
    On Robustness and Transferability of Convolutional Neural Networks
    Josip Djolonga
    Jessica Yung
    Michael Tschannen
    Rob Romijnders
    Alexander Nicholas D'Amour
    Dan Moldovan
    Sylvain Gelly
    Conference on Computer Vision and Pattern Recognition(2021)
    Preview abstract Modern deep convolutional networks (CNNs) are often criticized for their failure to generalize under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we revisit the out-of-distribution and transfer performance of modern image classification CNNs and investigate the impact of the pre-training data scale, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the robustness to distribution shifts. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset for fine-grained robustness analysis. View details
    Revisiting the Calibration of Modern Neural Networks
    Josip Djolonga
    Rob Romijnders
    Frances Ann Hubis
    Dustin Tran
    Neural Information Processing Systems(2021) (to appear)
    Preview abstract Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties. View details
    Preview abstract Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks with comparable pre-training and inference cost. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. View details
    Self-Supervised Learning of Video-Induced Visual Invariances
    Michael Tobias Tschannen
    Josip Djolonga
    Sylvain Gelly
    Conference on Computer Vision and Pattern Recognition(2020)
    Preview abstract We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We make use of the natural hierarchy consisting of (i) frame level invariances (e.g. color and contrast robustness), (ii) shot/clip level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video level invariances (semantic relationships of scenes across shots/clips) to define a holistic self-supervised loss. We train the proposed model on the YouTube-8M dataset and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB). We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with $10x$ fewer labeled images. View details
    Evaluating Generative Models using Divergence Frontiers
    Josip Djolonga
    Marco Cuturi
    Sylvain Gelly
    International Conference on Artificial Intelligence and Statistics(2020)
    Preview abstract Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Very recent developments have investigated metrics that quantify which parts of the true distribution is well modeled, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper we present a general evaluation framework for generative models that measures the trade-off between precision and recall using R\'enyi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains. As a key advantage, it allows for efficient algorithms that are directly applicable to continuous distributions directly without discretization. We further showcase the proposed techniques on a set of image synthesis models. View details
    Underspecification Presents Challenges for Credibility in Modern Machine Learning
    Alexander Nicholas D'Amour
    Dan Moldovan
    Babak Alipanahi
    Alex Beutel
    Christina Chen
    Jon Deaton
    Jacob Eisenstein
    Shaobo Hou
    Ghassen Jerfel
    Yian Ma
    Akinori Mitani
    Andrea Montanari
    Christopher Nielsen
    Thomas Osborne
    Rajiv Raman
    Kim Ramasamy
    Rory Abbott Sayres
    Martin Gamunu Seneviratne
    Shannon Sequeira
    Harini Suresh
    Victor Veitch
    Journal of Machine Learning Research(2020)
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    A Commentary on the Unsupervised Learning of Disentangled Representations
    Francesco Locatello
    Stefan Bauer
    Gunnar Rätsch
    Sylvain Gelly
    Bernhard Scholkopf
    AAAI Conference on Artificial Intelligence(2020)
    Preview abstract The goal of the unsupervised learning of disentangled representations is to separate the independent explanatory factors of variation in the data without access to supervision. In this paper, we summarize the results of (Locatello et al. 2019b) and focus on their implications for practitioners. We discuss the theoretical result showing that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases and the practical challenges it entails. Finally, we comment on our experimental findings, highlighting the limitations of state-of-the-art approaches and directions for future research. View details
    On Mutual Information Maximization for Representation Learning
    Michael Tobias Tschannen
    Josip Djolonga
    Paul Kishan Rubenstein
    Sylvain Gelly
    International Conference on Learning Representations(2020)
    Preview abstract Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods might be only loosely attributed to the properties of MI, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods. View details
    Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
    Francesco Locatello
    Stefan Bauer
    Gunnar Rätsch
    Sylvain Gelly
    Bernhard Schölkopf
    International Conference on Machine Learning(2019)
    Preview abstract In recent years, the interest in unsupervised learning of disentangled representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on auto-encoding and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than 12000 models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties "encouraged" by the corresponding losses. On the negative side, we observe in our study that well-disentangled models seemingly cannot be identified without access to ground-truth labels even if we are allowed to transfer hyperparameters across data sets. Furthermore, increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets. View details
    Preview abstract Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial net-works has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform state-of-the-art on both unsupervised ImageNet synthesis,as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the art conditional model BigGAN on ImageNet using only 10% of the labels and outperform it using 20% of the labels. View details
    Semantic Bottleneck Scene Generation
    Samaneh Azadi
    Michael Tobias Tschannen
    Eric Tzeng
    Sylvain Gelly
    Trevor Darrell
    arXiv(2019)
    Preview abstract We propose an end-to-end GAN-based model to perform an unconditional synthesis of complex scenes. Our model first synthesizes a realistic segmentation layout, and then synthesizes a realistic scene conditioned on that layout. For the former, we use an unsupervised progressive segmentation generation network which captures the distribution of the realistic semantic scene layouts. For the latter, we use a conditional segmentation-to-image synthesis network which captures the distribution of photo-realistic images conditioned on the semantic layout. We show that end-to-end outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Frechet Inception Distance and user-study evaluations. Moreover, we demonstrate the generated segmentation maps can be used as additional training data to strongly improve the recent segmentation-to-image synthesis networks. View details
    A Large-Scale Study on Regularization and Normalization in GANs
    Karol Kurach
    Marcin Michalski
    Sylvain Gelly
    International Conference on Machine Learning(2019)
    Preview abstract Generative Adversarial Networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant amount of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of "tricks". The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, and neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We reproduce the current state of the art and go beyond fairly exploring the GAN landscape. We discuss common pitfalls and reproducibility issues, open-source our code on Github, and provide pre-trained models on TensorFlow Hub. View details
    On Self-Modulation for Generative Adversarial Networks
    Ting Chen
    Sylvain Gelly
    International Conference on Learning Representations(2019)
    Preview abstract Training Generative Adversarial Networks (GANs) is a notoriously challenging task. In this work we propose and study an architectural modification, deemed self-modulation, which improves GAN performance across different data sets, architectures, losses, regularizers, and hyperparameter settings. Intuitively, with self-modulation one allows the intermediate feature maps to change as a function of the input z. While reminiscent of other conditioning techniques, it requires no labeled data. Nevertheless, this simple yet effective approach can be readily applied in the conditional setting if side information is available. In a large-scale empirical study we observe a relative decrease of 5%-35% in FID. Furthermore, everything else being equal, just adding this modification to the generator leads improved performance in ~86% of the studied settings which suggest that it can be applied without extensive hyperparameter optimization. View details
    Scalable k-Means Clustering via Lightweight Coresets
    Andreas Krause
    International Conference on Knowledge Discovery and Data Mining(2018)
    Preview abstract Coresets are compact representations of datasets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-Means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel and the resulting coresets are smaller. We demonstrate that the proposed method outperforms existing coreset constructions in practice. View details
    Preview abstract Learning useful representations with little or no supervision is a key challenge in artificial intelligence. We provide an in-depth review of recent advances in representation learning with a focus on autoencoder-based models. To organize these results we make use of meta-priors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features. In particular, we uncover three main mechanisms to enforce such properties, namely (i) regularizing the (approximate or aggregate) posterior distribution, (ii) factorizing the encoding and decoding distribution, or (iii) introducing a structured prior distribution. While there are some promising results, implicit or explicit supervision remains a key enabler and all current methods use strong inductive biases and modeling assumptions. Finally, we provide an analysis of autoencoder-based representation learning through the lens of rate-distortion theory and identify a clear tradeoff between the amount of prior knowledge available about the downstream tasks, and how useful the representation is for this task. View details
    Preview abstract Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labelled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, to close the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game. The role of self-supervision is to encourage the discriminator to learn meaningful feature representations which are not forgotten during training. We test empirically both the quality of the learned image representations, and the quality of the synthesized images. Under the same conditions, the self-supervised GAN attains a similar performance to state-of-the-art conditional counterparts. Finally, we show that this approach to fully unsupervised learning can be scaled to attain an FID of 33 on unconditional ImageNet generation. View details
    Are GANs Created Equal? A Large-Scale Study
    Karol Kurach
    Marcin Michalski
    Sylvain Gelly
    Advances in Neural Information Processing Systems(2018)
    Preview abstract Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting new GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study encompassing the state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can come from computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Also, we did not find evidence that any of the tested algorithms consistently outperform the original one. View details
    One-Shot Coresets: The Case of k-Clustering
    International Conference on Artificial Intelligence and Statistics(2018)
    Preview abstract Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent --- the practitioner has to commit to a fixed clustering objective before even exploring the data. However, can one construct small data summaries for a wide range of clustering problems simultaneously? In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees. View details
    Deep Generative Models for Distribution-Preserving Lossy Compression
    Michael Tschannen
    Advances in Neural Information Processing Systems (NeurIPS)(2018)
    Preview abstract We propose and study the problem of distribution-preserving lossy compression. Motivated by the recent advances in extreme image compression which allow to maintain artifact-free reconstructions even at very low bitrates, we propose to optimize the rate-distortion tradeoff under the constraint that the reconstructed samples follow the distribution of the training data. Such a compression system recovers both ends of the spectrum: On one hand, at zero bitrate it learns a generative model of the data, and at high enough bitrates it achieves perfect reconstruction. Furthermore, for intermediate bitrates it smoothly interpolates between matching the distribution of the training data and perfectly reconstructing the training samples. We study several methods to approximately solve the proposed optimization problem, including a novel combination of Wasserstein GAN and Wasserstein Autoencoder, and present strong theoretical and empirical results for the proposed compression system. View details
    Preview abstract Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as Fr\'echet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and the Variational Autoencoder. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution. View details
    Preview abstract A variety of large-scale machine learning problems can be cast as instances of constrained submodular maximization. Existing approaches for distributed submodular maximization have a critical drawback: The capacity – the number of instances that can fit in memory of each machine – must grow with the data set size. In practice, while one can provision many machines, the capacity of each machine is limited by physical constraints. We propose a truly scalable approach for distributed submodular maximization when the capacity of each machine is fixed. The proposed framework applies to a broad class of algorithms and a variety of constraints. We provide theoretical guarantees on the approximation factor for any available capacity. We empirically evaluate the proposed algorithm on a variety of data sets and demonstrate that the algorithm achieves performance competitive with the centralized greedy solution. View details
    Training Gaussian Mixture Models at Scale via Coresets
    Matthew Faulkner
    Andreas Krause
    Dan Feldman
    Journal of Machine Learning Research, 18(2018), pp. 1-25
    Preview abstract How can we train a statistical mixture model on a massive data set? In this work we show how to construct \emph{coresets} for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being \emph{independent} of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real- world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error. View details
    Distributed and Provably Good Seedings for k-Means in Constant Rounds
    Andreas Krause
    International Conference on Machine Learning(2017)
    Preview abstract The k-Means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-Means|| algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-Means|| algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-Means|| provides *provably good* clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-Means|| performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an *additive* error term as encountered in our analysis is inevitable if less than k-1 rounds are employed. View details
    Uniform Deviation Bounds for Unbounded Loss Functions like k-Means
    S. Hamed Hassani
    Andreas Krause
    International Conference on Machine Learning(2017)
    Preview abstract Uniform deviation bounds limit the difference between a model's expected loss and its loss on an empirical sample *uniformly* for all models in a learning problem. In this paper, we provide a novel framework to obtain uniform deviation bounds for loss functions which are *unbounded*. As a result, we obtain competitive uniform deviation bounds for k-Means clustering under weak assumptions on the underlying distribution. If the fourth moment is bounded, we prove a rate of O(m^(-1/2)) compared to the previously known O(m^(-1/4)) rate. Furthermore, we show that the rate also depends on the kurtosis - the normalized fourth moment which measures the "tailedness" of a distribution. We also provide improved rates under progressively stronger assumptions, namely, bounded higher moments, subgaussianity and bounded support of the underlying distribution. View details
    Stochastic Submodular Maximization: The Case of Coverage Functions
    Andreas Krause
    Hamed Hassani
    Mohammad Reza Karimi
    Advances in Neural Information Processing Systems(2017)
    Preview abstract Stochastic optimization of continuous objectives is at the heart of modern machine learning. However, many important problems are of discrete nature and often involve submodular objectives. We seek to unleash the power of stochastic continuous optimization, namely stochastic gradient descent and its variants, to such discrete problems. We first introduce the problem of stochastic submodular optimization, where one needs to optimize a submodular objective which is given as an expectation. Our model captures situations where the discrete objective arises as an empirical risk (e.g., in the case of exemplar-based clustering), or is given as an explicit stochastic model (e.g., in the case of influence maximization in social networks). By exploiting that common extensions act linearly on the class of submodular functions, we employ projected stochastic gradient ascent and its variants in the continuous domain, and perform rounding to obtain discrete solutions. We focus on the rich and widely used family of weighted coverage functions. We show that our approach yields solutions that are guaranteed to match the optimal approximation guarantees, while reducing the computational cost by several orders of magnitude, as we demonstrate empirically. View details
    Linear-Time Outlier Detection via Sensitivity
    Andreas Krause
    International Joint Conference on Artificial Intelligence(2016)
    Preview abstract Outliers are ubiquitous in modern data sets. Distance-based techniques are a popular non-parametric approach to outlier detection as they require no prior assumptions on the data generating distribution and are simple to implement. Scaling these techniques to massive data sets without sacrificing accuracy is a challenging task. We propose a novel algorithm based on the intuition that outliers have a significant influence on the quality of divergence-based clustering solutions. We propose sensitivity - the worst-case impact of a data point on the clustering objective - as a measure of outlierness. We then prove that influence - a (non-trivial) upper-bound on the sensitivity can be computed by a simple linear time algorithm. To scale beyond a single machine, we propose a communication efficient distributed algorithm. In an extensive experimental evaluation, we demonstrate the effectiveness and establish the statistical significance of the proposed approach. In particular, it outperforms the most popular distance-based approaches while being several orders of magnitude faster. View details
    Fast and Provably Good Seedings for k-Means
    S. Hamed Hassani
    Andreas Krause
    Neural Information Processing Systems(2016)
    Preview abstract Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets. View details
    Approximate K-Means++ in Sublinear Time
    S. Hamed Hassani
    Andreas Krause
    AAAI Conference on Artificial Intelligence(2016)
    Preview abstract The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude. View details
    Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
    Andreas Krause
    International Conference on Artificial Intelligence and Statistics(2016)
    Preview abstract Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation. View details
    Coresets for Nonparametric Estimation - the Case of DP-Means
    Andreas Krause
    International Conference on Machine Learning(2015)
    Preview abstract Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP- Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the "naive" approach of uniformly subsampling the data incurs an approximation error of 22.5%. View details