 
                Pablo Samuel Castro
Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Jesse Farebrother
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Joshua Greaves
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Rishabh Agarwal
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Charline Le Lan
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marc Bellemare
                      
                    
                  
              
            
          
          
          
          
            International Conference on Learning Representations (ICLR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Behavioural metrics have been shown to be an effective mechanism for constructing representations in reinforcement learning. We present a novel perspective on
behavioural metrics for Markov decision processes via the use of positive definite
kernels. We leverage this new perspective to define a new metric that is provably
equivalent to the recently introduced MICo distance (Castro et al., 2021). The kernel perspective further enables us to provide new theoretical results, which has so far
eluded prior work. These include bounding value function differences by means of
our metric, and the demonstration that our metric can be provably embedded into a
finite-dimensional Euclidean space with low distortion error. These are two crucial
properties when using behavioural metrics for reinforcement learning representations. We complement our theory with strong empirical results that demonstrate
the effectiveness of these methods in practice.
              
  
View details
          
        
      
    
        
          
            
              Bigger, Better, Faster: Human-level Atari with human-level efficiency
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Max Schwarzer
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Johan Obando Ceron
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Aaron Courville
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marc Bellemare
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Rishabh Agarwal
                      
                    
                  
              
            
          
          
          
          
            ICML (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on scaling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about moving the goalpost for sample-efficient RL research on the ALE.
              
  
View details
          
        
      
    
        
          
            
              Offline Reinforcement Learning with On-Policy Q-Function Regularization
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Laixi Shi
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuejie Chi
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Matthieu Geist
                      
                    
                  
              
            
          
          
          
          
            European Conference on Machine Learning (ECML) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which 
is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the  extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
              
  
View details
          
        
      
    
        
          
            
              The State of Sparse Training in Deep Reinforcement Learning
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Erich Elsen
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Proceedings of the 39th International Conference on Machine Learning, PMLR (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              The use of sparse neural networks has seen a rapid growth in recent years, particularly in computer vision; their appeal stems largely due to the reduced number of parameters required to train and store, as well as in an increase in learning efficiency. Somewhat surprisingly, there have been very few efforts exploring their use in deep reinforcement learning (DRL). In this work we perform a systematic investigation into applying a number of existing sparse training techniques on a variety of DRL agents and environments. Our results highlight the overall challenge that reinforcement learning poses for sparse training methods, complemented by detailed analyses on how the various components in DRL are affected by the use of sparse networks. We conclude by suggesting some promising avenues for improving the effectiveness of general sparse training methods, as well as for advancing their use in DRL.
              
  
View details
          
        
      
    
        
          
            
              Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Rishabh Agarwal
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Max Allen Schwarzer
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Aaron Courville
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marc G. Bellemare
                      
                    
                  
              
            
          
          
          
          
            NeurIPS (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further. Open-sourced code and trained agents at 
  
View details
          
        
      
    
        
          
            
              A general class of surrogate functions for stable and efficient reinforcement learning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Sharan Vaswani
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Simone Totaro
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Robert Müller
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Shivam Garg
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Matthieu Geist
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marlos C. Machado
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Nicolas Le Roux
                      
                    
                  
              
            
          
          
          
          
            AISTATS (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO, or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple reinforcement learning problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
              
  
View details
          
        
      
    
        
          
            
              Contrastive Behavioural Similarity Embeddings for Generalization in Reinforcement Learning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Rishabh Agarwal
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Marlos C. Machado
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marc G. Bellemare
                      
                    
                  
              
            
          
          
          
          
            International Conference on Learning Representations (2021)
          
          
        
        
        
          
              Preview abstract
          
          
              Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments.  To improve generalization, we incorporate the inherent sequential structure in reinforcement learning for learning better representations. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioural similarity between states.  PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite. Source code would be made available at  agarwl.github.io/pse .
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Reinforcement learning techniques are being applied to increasingly larger systems where it becomes untenable to maintain direct estimates for
individual states, in particular for continuous-state
systems. Instead, researchers often leverage state
similarity (whether implicitly or explicitly) to
build models that can generalize well from a limited set of samples. The notion of state similarity
used is thus of crucial importance, as it will directly affect the quality of the approximations
and performance of the algorithms. Indeed, there
have been a number of works that investigate –
both on a theoretical and an empirical basis – how
best to construct these neighborhoods and topologies. However, the choice of metric is not always
clear and is often not fully specified when new
algorithms are introduced. In this paper we aim
to clarify the landscape of existing metrics and
provide guidelines for the choice of metric when
designing or implementing algorithms. We do this
by first introducing a unified formalism for specifying these topologies, through the lens of metrics
or distance measures, and clarify the relationship
between them. We establish a hierarchy amongst
the different metrics and their theoretical implications on the Markov Decision Process (MDP)
specifying the reinforcement learning problem.
We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.
              
  
View details
          
        
      
    