More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Kaiwen Wang
Owen Oertell
Nathan Kallus
Wen Sun
International Conference of Machine Learning (2024)

Abstract

In this paper, we prove that Distributional RL (DistRL), which learns the return distribution, can
obtain second-order bounds in both online and offline RL for general MDPs. Second-order bounds are
instance-dependent bounds that scale with the variance of the policy’s return, which we prove are strictly
tighter than the previously known small-loss (first-order) bounds of Wang et al. (2023b). To the best
of our knowledge, our results are the first second-order bounds for low-rank MDPs and offline RL with
single-policy coverage. We highlight that our analysis with DistRL is relatively simple and does not
require any weighted regression techniques. Our results establish DistRL as a promising framework for
obtaining second-order bounds in general settings and thus further reinforce the benefits of DistRL.
×