BRPO: Batch Residual Policy Optimization

Sungryull Sohn; Yinlam Chow; Jayden Ooi; Ofir Nachum; Honglak Lee; Ed Chi; Craig Boutilier

BRPO: Batch Residual Policy Optimization

Sungryull Sohn

Yinlam Chow

Jayden Ooi

Ofir Nachum

Honglak Lee

Ed Chi

Craig Boutilier

Proceedings of the Twenty-ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan (2020), pp. 2824-2830

Download Google Scholar

Abstract

In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the- art performance in a number of tasks.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

BRPO: Batch Residual Policy Optimization

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs