Hugo Larochelle

Hugo Larochelle

I am a Principal Scientist in the Google DeepMind team in Montreal. My main area of expertise is deep learning. My previous work includes unsupervised pretraining with autoencoders, denoising autoencoders, visual attention-based classification, neural autoregressive distribution models and zero-shot learning. More broadly, I’m interested in applications of deep learning to natural language processing, code, computer vision and environmental sustainability problems.

Previously, I was Associate Professor at the Université de Sherbrooke (UdeS). I also co-founded Whetlab, which was acquired in 2015 by Twitter, where I then worked as a Research Scientist in the Twitter Cortex group. From 2009 to 2011, I was also a member of the machine learning group at the University of Toronto, as a postdoctoral fellow under the supervision of Geoffrey Hinton. I obtained my Ph.D. at the Université de Montréal, under the supervision of Yoshua Bengio.

My academic involvement includes being a member of the boards for the International Conference on Machine Learning (ICML) and for the Neural Information Processing Systems (NeurIPS) conference. I also co-founded the journal Transactions on Machine Learning Research.

Finally, I have a popular online course on deep learning and neural networks, freely accessible on YouTube.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a ``static'' setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. As an alternative, we develop an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and ``learns to execute'' descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code. View details
    Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
    Mike Mozer
    Proceedings of the 39th International Conference on Machine Learning, PMLR (2022)
    Preview abstract Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a source domain. A cost-efficient strategy, , involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method--- all parameters of the source model to the target domain---possibly because fine tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded. We explore the hypothesis that these intermediate layers might be directly exploited by linear probing. We propose a method, , that selects features from all layers of the source model to train a target-domain classification head. In evaluations on the Visual Task Adaptation Benchmark, Head2Toe matches performance obtained with fine tuning on average, but critically, for out-of-distribution transfer, Head2Toe outperforms fine tuning. View details
    Preview abstract The goal of program synthesis from examples is to find a computer program that is consistent with a given set of input-output examples. Most learning-based approaches try to find a program that satisfies all examples at once. Our work, by contrast, considers an approach that breaks the problem into two stages: (a) find programs that satisfy only one example, and (b) leverage these per-example solutions to yield a program that satisfies all examples. We introduce the Cross Aggregator neural network module based on multi-head attention mechanism that learns to combine the cues present in these per-example solutions to synthesize a global solution. Evaluation across programs of different lengths and under two different experimental settings reveal that when given the same budget, our technique significantly improves the success rate over PCCoder [Zohar et. al 2018] and other ablation baselines. View details
    Preview abstract Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other. As a result of diverging evaluation norms, a direct or thorough comparison of different approaches is challenging. To bridge this gap, we introduce a few-shot classification evaluation protocol named VTAB+MD with the explicit goal of facilitating sharing of insights from each community. We demonstrate its accessibility in practice by performing a cross-family study of the best transfer and meta learners which report on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB). We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD. However, BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks. We hope that this work contributes to accelerating progress on few-shot learning research. View details
    Preview abstract Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a structure that can define a wide array of dataset-specialized models, by plugging in appropriate parameter-light components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of task-specific parameters to insert into the universal template. We design a separate network that produces a carefully-crafted initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves state-of-the-art on the challenging Meta-Dataset benchmark. View details
    Preview abstract Few-shot classification aims to recognize unseen classes given only few samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is effectively integrating the feature representations from the diverse set of training domains. Here, we propose a Universal Representation Transformer (URT) layer, that meta-learns to leverage universal features for few-shot classification by dynamically re-weighting and composing the most appropriate domain-specific representations. In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset. Specifically, it outperforms the best previous model on 3 data sources and otherwise matches it on the others. We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization. View details
    Impact of Aliasing on Generalization in Deep Convolutional Networks
    Nicolas Le Roux
    Rob Romijnders
    International Conference on Computer Vision ICCV 2021, IEEE/CVF (2021)
    Preview abstract Traditionally image pre-processing in the frequency domain has played a vital role in computer vision and was even part of the standard pipeline in the early days of Deep Learning. However, with the advent of large datasets many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself \emph{if they aid in achieving stronger performance}. Frequency aliasing is a phenomena that may occur when down-sampling (sub-sampling) any signal, such as an image or feature map. We demonstrate that substantial improvements on OOD generalization can be obtained by mitigating the effects of aliasing by placing non-trainable blur filters and using smooth activation functions at key locations in the ResNet family of architectures -- helping to achieve new state-of-the-art results on two benchmarks without any hyper-parameter sweeps. View details
    Preview abstract Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., bonds in chemical molecules or abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that this model outperforms baseline methods even without providing the hand-engineered semantic edges that those baselines use. View details
    Revisiting Fundamentals of Experience Replay
    Liam B. Fedus
    Mark Rowland
    Prajit Ramachandran
    Will Dabney
    Yoshua Bengio
    International Conference on Machine Learning (2020)
    Preview abstract Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits. View details
    The Hanabi Challenge: A New Frontier for AI Research
    Nolan Bard
    Jakob N. Foerster
    Sarath Chandar
    Neil Burch
    Marc Lanctot
    H. Francis Song
    Emilio Parisotto
    Subhodeep Moitra
    Edward Hughes
    Iain Dunning
    Shibl Mourad
    Marc G. Bellemare
    Michael Bowling
    Artificial Intelligence, 280 (2020)
    Preview abstract From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains have driven research by providing sophisticated yet well-defined challenges for artificial intelligence practitioners. We continue this tradition by proposing the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay with two to five players and imperfect information. In particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground. We believe developing novel techniques for such theory of mind reasoning will not only be crucial for success in Hanabi, but also in broader collaborative efforts, especially those with human partners. To facilitate future research, we introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances, and assess the performance of current state-of-the-art techniques. View details