Jump to Content

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Ali Mousavi
Lihong Li
Qiang Liu
ICLR (2020)


Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \citet{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods, but are limited in practice as it requires that data be collected by a \emph{single} and \emph{known} behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as one of solving for the fixed point of a ``backward flow'' operator, the solution of which gives the desired importance ratios of stationary distributions between the target and behavior policies. Experiments on benchmarks verify the effectiveness of the approach.

Research Areas