Distributional reinforcement learning with linear function approximation

Marc G. Bellemare
Nicolas Le Roux
Subhodeep Moitra
ICML (2019)
Google Scholar

Abstract

Geometry-sensitive metrics between probability distributions play an important role in distributional reinforcement learning \citep{bellemare2017distributional}. In particular, the C51 algorithm can be partially explained in terms of one such metric, the Cram\'er distance (Rowland et al., 2018). The explanation is partial, however, because C51 uses a softmax to guarantee that its output is a proper distribution, and subsequently a cross-entropy loss, neither of which are related to the Cram\'er metric nor even geometry-sensitive. In this paper we extend the work of \citet{rowland2018} for the tabular setting and ask the question: can a fully Cram\'er-based, theoretically-sound algorithm be derived in the presence of function approximation? We replace C51's softmax with a simple linear transfer function and derive an algorithm solely based on the Cram\'er loss. We show that minimizing a variant of the Cram\'er loss implicitly yields proper distributions in the absence of approximation constraints. We derive a new metric tailored to this transfer function, and provide the first proof of convergence of a distributional algorithm combined with function approximation, in the context of policy evaluation. We find a surprising negative result showing that Cram\'er-based methods, including the original C51 algorithm, should perform worse than directly approximating the value function.
As a whole, our results provide new tools for understanding what drives the superior performance of the distributional approach in the approximate setting.

Research Areas