Breaking the curse of horizon: Infinite-horizon off-policy estimation

Qiang Liu; Lihong Li; Ziyang Tang; Denny Zhou

Breaking the curse of horizon: Infinite-horizon off-policy estimation

Qiang Liu

Lihong Li

Ziyang Tang

Denny Zhou

NeurIPS (Spotlight) (2018)

Download Google Scholar

Abstract

We consider off-policy estimation of the expected reward of a target policy using
samples collected by a different behavior policy. Importance sampling (IS) has
been a key technique for deriving (nearly) unbiased estimators, but is known to
suffer from an excessively high variance in long-horizon problems. In the extreme
case of infinite-horizon problems, the variance of an IS-based estimator may even
be unbounded. In this paper, we propose a new off-policy estimator that applies
IS directly on the stationary state-visitation distributions to avoid the exploding
variance faced by existing methods. Our key contribution is a novel approach to
estimating the density ratio of two stationary state distributions, with trajectories
sampled from only the behavior distribution. We develop a mini-max loss function
for the estimation problem, and derive a closed-form solution for the case of RKHS.
We support our method with both theoretical and empirical analyses.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Breaking the curse of horizon: Infinite-horizon off-policy estimation

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs