Google Research

The difficulty of passive learning in deep reinforcement learning

NeurIPS 2021 (2021)


Offline reinforcement learning, which uses observational data instead of active environmental interaction, has been shown to be a challenging problem. Recent solutions typically involve constraints on the learner’s policy, preventing strong deviations from the state-action distribution of the dataset. Although the suggested methods are evaluated using non-linear function approximation, their theoretical justifications are mostly limited to the tabular or linear cases. Given the impressive results of deep reinforcement learning, we argue for a clearer understanding of the challenges in this setting.

In the vein of Held & Hein's classic 1963 experiment, we propose “tandem learning”, an experimental paradigm which facilitates our in-depth empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with inadequate data distributions as the strongest factors, thereby extending but also challenging certain assumptions made in past work. Our results provide a more principled view, and new insights on potential directions for future work on offline reinforcement learning.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work