More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning
Abstract
In this paper, we prove that Distributional RL (DistRL), which learns the return distribution, can
obtain second-order bounds in both online and offline RL for general MDPs. Second-order bounds are
instance-dependent bounds that scale with the variance of the policy’s return, which we prove are strictly
tighter than the previously known small-loss (first-order) bounds of Wang et al. (2023b). To the best
of our knowledge, our results are the first second-order bounds for low-rank MDPs and offline RL with
single-policy coverage. We highlight that our analysis with DistRL is relatively simple and does not
require any weighted regression techniques. Our results establish DistRL as a promising framework for
obtaining second-order bounds in general settings and thus further reinforce the benefits of DistRL.
obtain second-order bounds in both online and offline RL for general MDPs. Second-order bounds are
instance-dependent bounds that scale with the variance of the policy’s return, which we prove are strictly
tighter than the previously known small-loss (first-order) bounds of Wang et al. (2023b). To the best
of our knowledge, our results are the first second-order bounds for low-rank MDPs and offline RL with
single-policy coverage. We highlight that our analysis with DistRL is relatively simple and does not
require any weighted regression techniques. Our results establish DistRL as a promising framework for
obtaining second-order bounds in general settings and thus further reinforce the benefits of DistRL.