Jump to Content
Hugo Larochelle

Hugo Larochelle

I am a Principal Scientist in the Google DeepMind team in Montreal. My main area of expertise is deep learning. My previous work includes unsupervised pretraining with autoencoders, denoising autoencoders, visual attention-based classification, neural autoregressive distribution models and zero-shot learning. More broadly, I’m interested in applications of deep learning to natural language processing, code, computer vision and environmental sustainability problems.

Previously, I was Associate Professor at the Université de Sherbrooke (UdeS). I also co-founded Whetlab, which was acquired in 2015 by Twitter, where I then worked as a Research Scientist in the Twitter Cortex group. From 2009 to 2011, I was also a member of the machine learning group at the University of Toronto, as a postdoctoral fellow under the supervision of Geoffrey Hinton. I obtained my Ph.D. at the Université de Montréal, under the supervision of Yoshua Bengio.

My academic involvement includes being a member of the boards for the International Conference on Machine Learning (ICML) and for the Neural Information Processing Systems (NeurIPS) conference. I also co-founded the journal Transactions on Machine Learning Research.

Finally, I have a popular online course on deep learning and neural networks, freely accessible on YouTube.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a ``static'' setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. As an alternative, we develop an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and ``learns to execute'' descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code. View details
    Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
    Mike Mozer
    Proceedings of the 39th International Conference on Machine Learning, PMLR (2022)
    Preview abstract Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a source domain. A cost-efficient strategy, , involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method--- all parameters of the source model to the target domain---possibly because fine tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded. We explore the hypothesis that these intermediate layers might be directly exploited by linear probing. We propose a method, , that selects features from all layers of the source model to train a target-domain classification head. In evaluations on the Visual Task Adaptation Benchmark, Head2Toe matches performance obtained with fine tuning on average, but critically, for out-of-distribution transfer, Head2Toe outperforms fine tuning. View details
    Preview abstract The goal of program synthesis from examples is to find a computer program that is consistent with a given set of input-output examples. Most learning-based approaches try to find a program that satisfies all examples at once. Our work, by contrast, considers an approach that breaks the problem into two stages: (a) find programs that satisfy only one example, and (b) leverage these per-example solutions to yield a program that satisfies all examples. We introduce the Cross Aggregator neural network module based on multi-head attention mechanism that learns to combine the cues present in these per-example solutions to synthesize a global solution. Evaluation across programs of different lengths and under two different experimental settings reveal that when given the same budget, our technique significantly improves the success rate over PCCoder [Zohar et. al 2018] and other ablation baselines. View details
    Preview abstract Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other. As a result of diverging evaluation norms, a direct or thorough comparison of different approaches is challenging. To bridge this gap, we introduce a few-shot classification evaluation protocol named VTAB+MD with the explicit goal of facilitating sharing of insights from each community. We demonstrate its accessibility in practice by performing a cross-family study of the best transfer and meta learners which report on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB). We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD. However, BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks. We hope that this work contributes to accelerating progress on few-shot learning research. View details
    Impact of Aliasing on Generalization in Deep Convolutional Networks
    Nicolas Le Roux
    Rob Romijnders
    International Conference on Computer Vision ICCV 2021, IEEE/CVF (2021)
    Preview abstract Traditionally image pre-processing in the frequency domain has played a vital role in computer vision and was even part of the standard pipeline in the early days of Deep Learning. However, with the advent of large datasets many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself \emph{if they aid in achieving stronger performance}. Frequency aliasing is a phenomena that may occur when down-sampling (sub-sampling) any signal, such as an image or feature map. We demonstrate that substantial improvements on OOD generalization can be obtained by mitigating the effects of aliasing by placing non-trainable blur filters and using smooth activation functions at key locations in the ResNet family of architectures -- helping to achieve new state-of-the-art results on two benchmarks without any hyper-parameter sweeps. View details
    Preview abstract Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a structure that can define a wide array of dataset-specialized models, by plugging in appropriate parameter-light components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of task-specific parameters to insert into the universal template. We design a separate network that produces a carefully-crafted initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves state-of-the-art on the challenging Meta-Dataset benchmark. View details
    Preview abstract Few-shot classification aims to recognize unseen classes given only few samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is effectively integrating the feature representations from the diverse set of training domains. Here, we propose a Universal Representation Transformer (URT) layer, that meta-learns to leverage universal features for few-shot classification by dynamically re-weighting and composing the most appropriate domain-specific representations. In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset. Specifically, it outperforms the best previous model on 3 data sources and otherwise matches it on the others. We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization. View details
    Preview abstract Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle this recently, we find the current procedure and datasets that are used to systematically assess progress in this setting lacking. To address this, we propose META-DATASET: a new benchmark for training and evaluating few-shot classifiers that is large-scale, consists of multiple datasets, and presents more natural and realistic tasks. The aim is to measure the ability of state-of the-art models to leverage diverse sources of data to achieve higher generalization, and to evaluate that generalization ability in a more challenging setting. We additionally measure robustness of current methods to variations in the number of available examples and the number of classes. Finally our extensive empirical evaluation leads us to identify weaknesses in Prototypical Networks and MAML, two popular few-shot classification methods, and to propose a new method, ProtoMAML, which achieves improved performance on our benchmark. View details
    Preview abstract Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Network (IPA-GNN), which achieves systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by developing a spectrum of models between RNNs operating on program traces with branch decisions as latent variables and GNNs. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a value function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks. View details
    Preview abstract Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., bonds in chemical molecules or abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that this model outperforms baseline methods even without providing the hand-engineered semantic edges that those baselines use. View details
    The Hanabi Challenge: A New Frontier for AI Research
    Nolan Bard
    Jakob N. Foerster
    Sarath Chandar
    Neil Burch
    Marc Lanctot
    H. Francis Song
    Emilio Parisotto
    Subhodeep Moitra
    Edward Hughes
    Iain Dunning
    Shibl Mourad
    Marc G. Bellemare
    Michael Bowling
    Artificial Intelligence, vol. 280 (2020)
    Preview abstract From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains have driven research by providing sophisticated yet well-defined challenges for artificial intelligence practitioners. We continue this tradition by proposing the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay with two to five players and imperfect information. In particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground. We believe developing novel techniques for such theory of mind reasoning will not only be crucial for success in Hanabi, but also in broader collaborative efforts, especially those with human partners. To facilitate future research, we introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances, and assess the performance of current state-of-the-art techniques. View details
    Revisiting Fundamentals of Experience Replay
    Liam B. Fedus
    Mark Rowland
    Prajit Ramachandran
    Will Dabney
    Yoshua Bengio
    International Conference on Machine Learning (2020)
    Preview abstract Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits. View details
    InfoBot: Structured Exploration in ReinforcementLearning Using Information Bottleneck
    Anirudh Goyal
    Riashat Islam
    Daniel Strouse
    Matthew Botvinick
    Yoshua Bengio
    Sergey Levine
    ICLR (2019)
    Preview abstract A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out decision states. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space. View details
    Recall Traces: Backtracking Models for Efficient Reinforcement Learning
    Anirudh Goyal
    Philemon Brakel
    Liam Fedus
    Soumye Singhal
    Timothy Lillicrap
    Sergey Levine
    Yoshua Bengio
    ICLR (2019)
    Preview abstract In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and samples which (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks. View details
    Meta-Learning for Semi-Supervised Few-Shot Classification
    Eleni Triantafillou
    Jake Snell
    Josh Tenenbaum
    Mengye Ren
    Richard Zemel
    Sachin Ravi
    ICLR (2018)
    Preview abstract In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress made in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. In this work, we advance this few-shot classification paradigm towards a scenario where unlabeled examples are also available within each episode. We consider two situations: one where all unlabeled examples are assumed to belong to the same set of classes as the labeled examples of the episode, as well as the more realistic situation where examples from other {\it distractor} classes are also provided. To address this paradigm, we propose novel extensions of prototypical networks (Snell et al. 2017) that are augmented with the ability to use unlabeled examples when producing prototypes. These models are trained in an end-to-end way on episodes, to learn to leverage the unlabeled examples successfully. We evaluate these methods on versions of the Omniglot and mini-ImageNet benchmarks, adapted to this new framework augmented with unlabeled examples. We also propose a new split of ImageNet. Our experiments confirm that our prototypical networks can learn to improve their predictions due to unlabeled examples, much like a semi-supervised algorithm would. View details
    Modulating early visual processing by language
    Harm de Vries
    Florian Strub
    Jérémie Mary
    Olivier Pietquin
    Aaron Courville
    NIPS (2017)
    Preview abstract It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the entire visual processing by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (MORES), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial. We finally show that ResNet image features are effectively grounded. View details
    A Meta-Learning Perspective on Cold-Start Recommendations for Items
    Manasi Vartak
    Arvind Thiagarajan
    Conrado Miranda
    Jeshua Bratman
    NIPS (2017)
    Preview abstract Matrix factorization (MF) is one of the most popular techniques for product recommendation, but is known to suffer from serious cold-start problems. Item cold-start problems are particularly acute in settings such as Tweet recommendation where new items arrive continuously. In this paper, we present a {\it meta-learning} strategy to address item cold-start when new items arrive continuously. We propose two deep neural network architectures that implement our meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adapted based on item history. We evaluate our techniques on the real-world problem of Tweet recommendation. On production data at Twitter, we demonstrate that our proposed techniques significantly beat the MF baseline with lookup table based user embeddings and also outperform the state-of-the-art production model for Tweet recommendation. View details
    Modulating early visual processing by language
    Harm de Vries
    Florian Strub
    Jérémie Mary
    Olivier Pietquin
    Aaron Courville
    Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 6594-6604
    Preview abstract It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the entire visual processing by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (MORES), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial. We finally show that ResNet image features are effectively grounded. View details
    MADE: Masked Autoencoder for Distribution Estimation
    Mathieu Germain
    Karol Gregor
    Iain Murray
    Proceedings of the 32nd International Conference on Machine Learning (2015)
    Guest editors' introduction: Special section on learning deep architectures
    Samy Bengio
    Li Deng
    Honglak Lee
    Ruslan Salakhutdinov
    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35 (2013), pp. 1795-1797
    Domain-Adversarial Training of Neural Networks
    Yaroslav Ganin
    Evgeniya Ustinova
    Hana Ajakan
    Pascal Germain
    François Laviolette
    Mario Marchand
    Victor Lempitsky
    Journal of Machine Learning Research, vol. 17 (2016)
    An autoencoder approach to learning bilingual word representations
    Sarath Chandar A P
    Stanislas Lauly
    Mitesh Khapra
    Balaraman Ravindran
    Vikas C Raykar
    Amrita Saha
    Advances in Neural Information Processing Systems 27 (2014)
    Practical bayesian optimization of machine learning algorithms
    Jasper Snoek
    Ryan P. Adams
    Advances in Neural Information Processing Systems 25 (2012)
    Conditional Restricted Boltzmann Machines for Structured Output Prediction
    Volodymyr Mnih
    Geoffrey E. Hinton
    UAI (2011), pp. 514-522
    The Neural Autoregressive Distribution Estimator
    Iain Murray
    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (2011)
    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
    Pascal Vincent
    Isabelle Lajoie
    Yoshua Bengio
    Pierre-Antoine Manzagol
    Journal of Machine Learning Research, vol. 11 (2010)
    Learning to combine foveal glimpses with a third-order Boltzmann machine
    Geoffrey E. Hinton
    NIPS (2010), pp. 1243-1251
    Exploring strategies for training deep neural networks
    Yoshua Bengio
    Jérôme Louradour
    Journal of Machine Learning Research, vol. 1 (2009)
    Extracting and composing robust features with denoising autoencoders
    Pascal Vincent
    Yoshua Bengio
    Pierre-Antoine Manzagol
    Proceedings of the 25th International Conference on Machine Learning (2008)
    Zero-data learning of new tasks
    Yoshua Bengio
    Proceedings of the 23rd AAAI Conference on Artificial Intelligence (2008)
    Classification using discriminative restricted boltzmann machines
    Yoshua Bengio
    Proceedings of the 25th International Conference on Machine Learning (2008)
    Greedy layer-wise training of deep networks
    Yoshua Bengio
    Dan Popovici
    Advances in neural information processing systems 19 (2007)
    An empirical evaluation of deep architectures on problems with many factors of variation
    Aaron Courville
    James Bergstra
    Yoshua Bengio
    Proceedings of the 24th International Conference on Machine Learning (2007)