Google Research

The Mirage of Action-Dependent Baselines in Reinforcement Learning

ICML (2018)


Model-free reinforcement learning with flexible function approximators has shown recent success for solving goal-directed sequential decision-making problems. Policy gradient methods are a promising class of model-free algorithms, but they have high variance, which necessitates large batches resulting in low sample efficiency. Typically, a state-dependent control variate is used to reduce variance. Recently, several papers have introduced the idea of state and action-dependent control variates and showed that they significantly reduce variance and improve sample efficiency on continuous control tasks. We theoretically and numerically evaluate biases and variances of these policy gradient methods, and show that action-dependent control variates do not appreciably reduce variance in the tested domains. We show that seemingly insignificant implementation details enable these prior methods to achieve good empirical improvements, but at the cost of introducing further bias to the gradient. Our analysis indicates that biased methods tend to improve the performance significantly more than unbiased ones.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work