Offline Reinforcement Learning with Pseudometric Learning
Abstract
Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs "close" to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric from logged transitions, and use it to define this notion of closeness. We show its convergence guarantees and extend it to the sampled function approximation setting. We then use this pseudometric to define a new look-up based malus in an actor-critic algorithm: this encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method against hand manipulation and locomotion tasks.