# Olivier Bousquet

Olivier received his PhD in Machine Learning from Ecole Polytechnique, France in 2002. He was then a researcher at the Max Planck Institute in Tuebingen, working on Machine Learning and in particular Statistical Learning Theory and Kernel Methods. In 2004 he joined a startup company where he lead a research team and developed ML software for predicting manufacturing quality.
Olivier joined Google Zurich in 2007 and contributed to many aspects of the search engine, in particular leading an engineering team working on Language Understanding and the Knowledge Graph. In 2016, he joined the Research team and is now working on Deep Learning and Language Understanding, leading the Brain teams in Zurich and Paris.
His research interests include Learning with limited supervision (Semi-supervised or Unsupervised), AutoML (automation of Deep Learning), Learning of world representations and world knowledge.

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Fine-Grained Distribution-Dependent Learning Curves

Ilya Tolstikhin

Jonathan Shafer

Shay Moran

Steve Hanneke

Proceedings of Thirty Sixth Conference on Learning Theory (COLT), PMLR 195:5890-5924, 2023.(2023)

Preview abstract
Learning curves plot the expected error of a learning algorithm as a function of the number of labeled input samples. They are widely used by machine learning practitioners as a measure of an algorithm's performance, but classic PAC learning theory cannot explain their behaviour. In this paper we introduce a new combinatorial characterization called the VCL dimension that improves and refines the recent results of Bousquet et al. (2021). Our characterization sheds new light on the structure of learning curves by providing fine-grained bounds, and showing that for classes with finite VCL, the rate of decay can be decomposed into a linear component that depends only on the hypothesis class and an exponential component that depends also on the target distribution. In particular, the finer nuance of the VCL dimension implies lower bounds that are quantitatively stronger than the bounds of Bousquet et al. (2021) and qualitatively stronger than classic 'no free lunch' lower bounds. The VCL characterization solves an open problem studied by Antos and Lugosi (1998), who asked in what cases such lower bounds exist. As a corollary, we recover their lower bound for half-spaces in ℝd, and we do so in a principled way that should be applicable to other cases as well. Finally, to provide another viewpoint on our work and how it compares to traditional PAC learning bounds, we also present an alternative formulation of our results in a language that is closer to the PAC setting.
View details

Differentially-Private Bayes Consistency

Haim Kaplan

Aryeh Kontorovich

Yishay Mansour

Shay Moran

Menachem Sadigurschi

Archive, Archive, Archive

Preview abstract
We construct a universally Bayes consistent learning rule
which satisfies differential privacy (DP).
We first handle the setting of binary classification
and then extend our rule to the more
general setting of density estimation (with respect to the total variation metric).
The existence of a universally consistent DP learner
reveals a stark difference with the distribution-free PAC model.
Indeed, in the latter DP learning is extremely limited:
even one-dimensional linear classifiers
are not privately learnable in this stringent model.
Our result thus demonstrates that by allowing
the learning rate to depend on the target distribution,
one can circumvent the above-mentioned impossibility result
and in fact learn \emph{arbitrary} distributions by a single DP algorithm.
As an application, we prove that any VC class can be privately learned in a semi-supervised
setting with a near-optimal \emph{labelled} sample complexity of $\tilde O(d/\eps)$ labeled examples
(and with an unlabeled sample complexity that can depend on the target distribution).
View details

What Do Neural Networks Learn When Trained With Random Labels?

Hartmut Maennel

Ilya Tolstikhin

Robert J. N. Baldock

Sylvain Gelly

NeurIPS 2020

Preview abstract
We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet.
View details

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Nathanael Schärli

Nathan Scales

Hylke Buisman

Daniel Furrer

Nikola Momchev

Danila Sinopalnikov

Lukasz Stafiniak

Tibor Tihon

Dmitry Tsarkov

ICLR(2020)

Preview abstract
State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.
View details

Evaluating Generative Models using Divergence Frontiers

Josip Djolonga

Marco Cuturi

Sylvain Gelly

International Conference on Artificial Intelligence and Statistics(2020)

Preview abstract
Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Very recent developments have investigated metrics that quantify which parts of the true distribution is well modeled, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper we present a general evaluation framework for generative models that measures the trade-off between precision and recall using R\'enyi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains.
As a key advantage, it allows for efficient algorithms that are directly applicable to continuous distributions directly without discretization. We further showcase the proposed techniques on a set of image synthesis models.
View details

When can unlabeled data improve the learning rate?

Christina Göpfert

Shai Ben-David

Sylvain Gelly

Ilya Tolstikhin

Ruth Urner

COLT 2019

Preview abstract
In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the {\em minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones).
We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate simple toy examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\l}$ to $e^{-c\l}$ and $1/\sqrt{\l}$ to $1/\l$. These results allow us to better understand what is and isn't possible in semi-supervised learning.
View details

Google Research Football: A Novel Reinforcement Learning Environment

Karol Kurach

Piotr Michal Stanczyk

Michał Zając

Carlos Riquelme

Damien Vincent

Marcin Michalski

Sylvain Gelly

AAAI(2019)

Preview abstract
Recent progress in the field of reinforcement learning has been accelerated by virtual learning environments such as video games, where novel algorithms and ideas can be quickly tested in a safe and reproducible manner. We introduce the Google Research Football Environment, a new reinforcement learning environment where agents are trained to play football in an advanced, physics-based 3D simulator.
The resulting environment is challenging, easy to use and customize, and it is available under a permissive open-source license. We further propose three full-game scenarios of varying difficulty with the Football Benchmarks, we report baseline results for three commonly used reinforcement algorithms (Impala, PPO, and Ape-X DQN), and we also provide a diverse set of simpler scenarios with the Football Academy.
View details

Practical and Consistent Estimation of f-Divergences

Paul Rubenstein

Josip Djolonga

Carlos Riquelme

Ilya Tolstikhin

Submission to Neurips 2019.(2019) (to appear)

Preview abstract
The estimation of an f-divergence between two probability distributions based on samples is a fundamental problem in statistics and machine learning. Most works study this problem under very weak assumptions, in which case it is provably hard. We consider the case of stronger structural assumptions that are commonly satisfied in modern machine learning, including representation learning and generative modelling with autoencoder architectures. Under these assumptions we propose and study an estimator that can be easily implemented, works well in high dimensions, and enjoys faster rates of convergence. We verify the behavior of our estimator empirically in both synthetic and real-data experiments.
View details

Are GANs Created Equal? A Large-Scale Study

Karol Kurach

Marcin Michalski

Sylvain Gelly

Advances in Neural Information Processing Systems(2018)

Preview abstract
Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting new GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study encompassing the state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can come from computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Also, we did not find evidence that any of the tested algorithms consistently outperform the original one.
View details

Assessing Generative Models via Precision and Recall

Sylvain Gelly

Advances in Neural Information Processing Systems(2018)

Preview abstract
Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as Fr\'echet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and the Variational Autoencoder. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution.
View details

Better Text Understanding Through Image-To-Text Transfer

Karol Kurach

Sylvain Gelly

Philip Haeusser

Olivier Teytaud

Damien Vincent

arXiv(2017)

Preview abstract
Generic text embeddings are successfully used in a variety of tasks. However, they are often learnt by capturing the co-occurrence structure from pure text corpora, resulting in limitations of their ability to generalize. In this paper, we explore models that incorporate visual information into the text representation. Based on comprehensive ablation studies, we propose a conceptually simple, yet well performing architecture. It outperforms previous multimodal approaches on a set of well established benchmarks. We also improve the state-of-the-art results for image-related text datasets, using orders of magnitude less data.
View details

From optimal transport to generative modeling: the VEGAN cookbook

Sylvain Gelly

Ilya Tolstikhin

Carl-Johann Simon-Gabriel

Bernhard Schoelkopf

arXiv(2017)

Preview abstract
We study unsupervised generative modeling in terms of the optimal transport (OT) problem between true (but unknown) data distribution PX and the latent variable model distribution PG. We show that the OT problem can be equivalently written in terms of probabilistic encoders, which are constrained to match the posterior and prior distributions over the latent space. When relaxed, this constrained optimization problem leads to a penalized optimal transport (POT) objective, which can be efficiently minimized using stochastic gradient descent by sampling from PX and PG. We show that POT for the 2-Wasserstein distance coincides with the objective heuristically employed in adversarial auto-encoders (AAE) (Makhzani et al., 2016), which provides the first theoretical justification for AAEs known to the authors. We also compare POT to other popular techniques like variational auto-encoders (VAE) (Kingma and Welling, 2014). Our theoretical results include (a) a better understanding of the commonly observed blurriness of images generated by VAEs, and (b) establishing duality between Wasserstein GAN (Arjovsky and Bottou, 2017) and POT for the 1-Wasserstein distance.
View details

AdaGAN: Boosting Generative Models

Ilya Tolstikhin

Sylvain Gelly

Carl-Johann Simon-Gabriel

Bernhard Schölkopf

arXiv(2017) (to appear)

Preview abstract
Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called AdaGAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a reweighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes.
View details

Toward Optimal Run Racing: Application to Deep Learning Calibration

Sylvain Gelly

Karol Kurach

Marc Schoenauer

Michele Sebag

Olivier Teytaud

Damien Vincent

arXiv(2017)

Preview abstract
This paper aims at one-shot learning of deep neural nets, where a highly parallel setting is considered to address the algorithm calibration problem - selecting the best neural architecture and learning hyper-parameter values depending on the dataset at hand. The notoriously expensive calibration problem is optimally reduced by detecting and early stopping non-optimal runs. The theoretical contribution regards the optimality guarantees within the multiple hypothesis testing framework. Experimentations on the Cifar10, PTB and Wiki benchmarks demonstrate the relevance of the approach with a principled and consistent improvement on the state of the art with no extra hyper-parameter.
View details

The Tradeoffs of Large Scale Learning

Léon Bottou

Advances in Neural Information Processing Systems, NIPS Foundation (http://books.nips.cc)(2008), pp. 161-168

Preview abstract
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.
View details

Learning with local and global consistency

Dengyong Zhou

Thomas Navin Lal

Jason Weston

Bernhard Schölkopf

Advances in Neural Information Processing Systems(2004), pp. 321-328

Preview abstract
We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference.
A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution.
Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data.
View details

Stability and generalization

Choosing multiple parameters for support vector machines

Olivier Chapelle

Vladimir Vapnik

Sayan Mukherjee

Machine Learning, 46(2002), pp. 131-159