Jump to Content
Pablo Samuel Castro

Pablo Samuel Castro

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Offline Reinforcement Learning with On-Policy Q-Function Regularization
    Laixi Shi
    Yuejie Chi
    Matthieu Geist
    European Conference on Machine Learning (ECML) (2023)
    Preview abstract The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks. View details
    Bigger, Better, Faster: Human-level Atari with human-level efficiency
    Max Schwarzer
    Johan Obando Ceron
    Aaron Courville
    Marc Bellemare
    ICML (2023)
    Preview abstract We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on scaling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about moving the goalpost for sample-efficient RL research on the ALE. View details
    Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks
    Jesse Farebrother
    Joshua Greaves
    Charline Le Lan
    Marc Bellemare
    International Conference on Learning Representations (ICLR) (2023)
    Preview abstract Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function. View details
    Preview abstract Behavioural metrics have been shown to be an effective mechanism for constructing representations in reinforcement learning. We present a novel perspective on behavioural metrics for Markov decision processes via the use of positive definite kernels. We leverage this new perspective to define a new metric that is provably equivalent to the recently introduced MICo distance (Castro et al., 2021). The kernel perspective further enables us to provide new theoretical results, which has so far eluded prior work. These include bounding value function differences by means of our metric, and the demonstration that our metric can be provably embedded into a finite-dimensional Euclidean space with low distortion error. These are two crucial properties when using behavioural metrics for reinforcement learning representations. We complement our theory with strong empirical results that demonstrate the effectiveness of these methods in practice. View details
    Preview abstract Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further. Open-sourced code and trained agents at View details
    The State of Sparse Training in Deep Reinforcement Learning
    Erich Elsen
    Proceedings of the 39th International Conference on Machine Learning, PMLR (2022)
    Preview abstract The use of sparse neural networks has seen a rapid growth in recent years, particularly in computer vision; their appeal stems largely due to the reduced number of parameters required to train and store, as well as in an increase in learning efficiency. Somewhat surprisingly, there have been very few efforts exploring their use in deep reinforcement learning (DRL). In this work we perform a systematic investigation into applying a number of existing sparse training techniques on a variety of DRL agents and environments. Our results highlight the overall challenge that reinforcement learning poses for sparse training methods, complemented by detailed analyses on how the various components in DRL are affected by the use of sparse networks. We conclude by suggesting some promising avenues for improving the effectiveness of general sparse training methods, as well as for advancing their use in DRL. View details
    A general class of surrogate functions for stable and efficient reinforcement learning
    Sharan Vaswani
    Simone Totaro
    Robert Müller
    Shivam Garg
    Matthieu Geist
    Marlos C. Machado
    Nicolas Le Roux
    AISTATS (2022)
    Preview abstract Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO, or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple reinforcement learning problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite. View details
    Revisiting Rainbow: More inclusive deep reinforcement learning research
    Johan Samir Obando-Cerón
    Proceedings of the 38th International Conference on Machine Learning, PMLR (2021)
    Preview abstract Since the introduction of DQN by \cite{mnih2015humanlevel}, a vast majority of reinforcement learning research has focused on reinforcement learning with the use of deep neural networks. New methods are typically evaluated on a set of standard environments that have now become standard, such as the Arcade Learning Environment (ALE) \citep{bellemare2012ale}. While these benchmarks help standardize evaluation, their computational cost has the unfortunate side effect of widening the gap between those with ample access to computational resources, and those without. In this work we argue that, despite the community's emphasis on large-scale environments, the traditional ``small-scale'' environments can still yield valuable scientific insights and can help reduce the barriers to entry for newcomers from underserved communities. To substantiate our claims, we empirically revisit paper which introduced the Rainbow algorithm \citep{hessel18rainbow} and present some new insights into the algorithms used by Rainbow. View details
    Metrics and continuity in reinforcement learning
    Charline Le Lan
    Marc G. Bellemare
    AAAI 2021 (2021)
    Preview abstract Reinforcement learning techniques are being applied to increasingly larger systems where it becomes untenable to maintain direct estimates for individual states, in particular for continuous-state systems. Instead, researchers often leverage state similarity (whether implicitly or explicitly) to build models that can generalize well from a limited set of samples. The notion of state similarity used is thus of crucial importance, as it will directly affect the quality of the approximations and performance of the algorithms. Indeed, there have been a number of works that investigate – both on a theoretical and an empirical basis – how best to construct these neighborhoods and topologies. However, the choice of metric is not always clear and is often not fully specified when new algorithms are introduced. In this paper we aim to clarify the landscape of existing metrics and provide guidelines for the choice of metric when designing or implementing algorithms. We do this by first introducing a unified formalism for specifying these topologies, through the lens of metrics or distance measures, and clarify the relationship between them. We establish a hierarchy amongst the different metrics and their theoretical implications on the Markov Decision Process (MDP) specifying the reinforcement learning problem. We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered. View details
    Deep Reinforcement Learning at the Edge of the Statistical Precipice
    Max Schwarzer
    Aaron Courville
    Marc G. Bellemare
    Advances in Neural Information Processing Systems (2021)
    Preview abstract Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field. See agarwl.github.io/rliable for more details. View details
    Preview abstract We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark. View details
    Preview abstract Offline reinforcement learning, which uses observational data instead of active environmental interaction, has been shown to be a challenging problem. Recent solutions typically involve constraints on the learner’s policy, preventing strong deviations from the state-action distribution of the dataset. Although the suggested methods are evaluated using non-linear function approximation, their theoretical justifications are mostly limited to the tabular or linear cases. Given the impressive results of deep reinforcement learning, we argue for a clearer understanding of the challenges in this setting. In the vein of Held & Hein's classic 1963 experiment, we propose “tandem learning”, an experimental paradigm which facilitates our in-depth empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with inadequate data distributions as the strongest factors, thereby extending but also challenging certain assumptions made in past work. Our results provide a more principled view, and new insights on potential directions for future work on offline reinforcement learning. View details
    Contrastive Behavioural Similarity Embeddings for Generalization in Reinforcement Learning
    Marlos C. Machado
    Marc G. Bellemare
    International Conference on Learning Representations (2021)
    Preview abstract Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning for learning better representations. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioural similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite. Source code would be made available at agarwl.github.io/pse . View details
    Scalable methods for computing state similarity in deterministic Markov Decision Processes
    Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) (2020)
    Preview abstract We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant way to capture behavioral equivalence between states which provide strong theoretical guarantees. Unfortunately, their computation is expensive and requires a tabular representation of the states; this has so far rendered them impractical for large problems. In this paper we present two new algorithms for approximating bisimulation metrics in deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation, even for continuous state MDPs, which prior to this work has not been possible. The methods we introduce enable the use of bisimulation metrics in problems of much larger scale than what was previously possible. View details
    Rigging The Lottery: Making All Tickets Winners
    Jacob Menick
    Erich Elsen
    International Conference of Machine Learning (2020)
    Preview abstract Recent work (Kalchbrenner et al., 2018) has demonstrated that sparsity in theparameters of neural networks leads to more parameter and floating-point oper-ation (flop) efficient networks and that these gains also translate into inferencetime reductions. There is a large body of work (Molchanov et al., 2017; Zhu &Gupta, 2017; Louizos et al., 2017; Li et al., 2016; Guo et al., 2016) on variousways ofpruningnetworks that require dense training but yield sparse networksfor inference. This limits the size of the largest trainable model to the largesttrainable dense model. Concurrently, other work (Mocanu et al., 2018; Mostafa& Wang, 2019; Bellec et al., 2017), have introduced dynamic sparse reparameter-ization training methods that allow a network to be trained while always sparse.However, they either do not reach the accuracy of pruning, or do not have a fixedFLOP cost due to parameter re-allocation during training. This work introducesa new method that does not require parameter re-allocation for end-to-end sparsetraining that matches and even exceeds the accuracy of dense-to-sparse methods.We show that this method requires less FLOPs to achieve a given level of accu-racy than previous methods. We also provide some insights into why static sparsetraining fails to find good minima and dynamic reparameterization succeeds. View details
    Preview abstract Since the introduction of Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] there has been a regular stream of both technical advances (e.g., Arjovsky et al. [2017]) and creative uses of these generative models (e.g., [Karras et al., 2019, Zhu et al., 2017, Jin et al., 2017]. In this work we propose an approach for using the power of GANs to automatically generate videos to accompany a audio recordings by aligning to spectral properties of the recording. This allows musicians to explore new forms of multi-modal creative expression, where musical performance can induce an AI-generated musical video that is guided by said performance, as well as a medium for creating a visual narrative to follow a storyline. View details
    Shaping the Narrative Arc: Information-Theoretic Collaborative Dialogue
    George Foster
    Marc G. Bellemare
    International Conference on Computational Creativity (2020)
    Preview abstract We consider the challenge of designing an artificial agent capable of interacting with humans in collaborative dialogue to produce creative, engaging narratives. Collaborative dialogue is distinct from chit-chat in that it is knowledge building, each utterance provides just enough information to add specificity and reduce ambiguity without limiting the conversation. We use concepts from information theory to define a narrative arc function which models dialogue progression. We demonstrate that this function can be used to modulate a generative conversation model and make it produce more interesting dialogues, compared to baseline outputs. We focus on two antithetical modes of modulation: reveal and conceal. Empirically, we show how the narrative arc function can model existing dialogues and shape conversation models towards either mode. We conclude with quantitative evidence suggesting that these modulated models provide interesting and engaging dialogue partners for improvisational theatre performers. View details
    Performing Structured Improvisations with Pre-existing Generative Musical Models
    Proceedings of the 10th International Conference on Computational Creativity (2019)
    Preview abstract The quality of outputs produced by generative models for music have seen a dramatic improvement in the last few years. However, most models perform in “offline” mode, with few restrictions on the processing time. Integrating these types of models into a live structured performance poses a challenge because of the necessity to respect the beat and harmony. In this paper we propose a system which enables the integration of out-of-the-box generative models by leveraging the musician’s creativity and expertise. View details
    Preview abstract Geometry-sensitive metrics between probability distributions play an important role in distributional reinforcement learning \citep{bellemare2017distributional}. In particular, the C51 algorithm can be partially explained in terms of one such metric, the Cram\'er distance (Rowland et al., 2018). The explanation is partial, however, because C51 uses a softmax to guarantee that its output is a proper distribution, and subsequently a cross-entropy loss, neither of which are related to the Cram\'er metric nor even geometry-sensitive. In this paper we extend the work of \citet{rowland2018} for the tabular setting and ask the question: can a fully Cram\'er-based, theoretically-sound algorithm be derived in the presence of function approximation? We replace C51's softmax with a simple linear transfer function and derive an algorithm solely based on the Cram\'er loss. We show that minimizing a variant of the Cram\'er loss implicitly yields proper distributions in the absence of approximation constraints. We derive a new metric tailored to this transfer function, and provide the first proof of convergence of a distributional algorithm combined with function approximation, in the context of policy evaluation. We find a surprising negative result showing that Cram\'er-based methods, including the original C51 algorithm, should perform worse than directly approximating the value function. As a whole, our results provide new tools for understanding what drives the superior performance of the distributional approach in the approximate setting. View details
    A Comparative Analysis of Expected and Distributional Reinforcement Learning
    Clare Lyle
    Marc G. Bellemare
    Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (2019)
    Preview abstract Since their introduction a year ago, distributional approaches to reinforcement learning (distributional RL) have produced strong results relative to the standard, expectation-based, approach (expected RL). However, aside from theoretical convergence guarantees, there have been few theoretical results investigating the reasons behind the improvements distributional RL provides. In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular, linear approximation, and non-linear approximation settings. We prove theoretically that in the tabular and linear approximation settings, distributional RL does not provide an advantage over expected RL, and can in fact hurt performance. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from. View details
    A Geometric Perspective on Optimal Representations for Reinforcement Learning
    Marc G. Bellemare
    Will Dabney
    Adrien Ali Taïga
    Nicolas Le Roux
    Tor Lattimore
    Clare Lyle
    NeurIPS (2019)
    Preview abstract We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain. View details
    No Results Found