Google Research

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

ICLR (2020)


Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \citet{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods, but are limited in practice as it requires that data be collected by a \emph{single} and \emph{known} behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as one of solving for the fixed point of a ``backward flow'' operator, the solution of which gives the desired importance ratios of stationary distributions between the target and behavior policies. Experiments on benchmarks verify the effectiveness of the approach.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work