Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs

Dheeraj Nagaraj
Syomantak Chaudhari
Google Scholar

Abstract

Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation \citep{mnih2015human}. In contrast, existing theoretical results are pessimistic about Q-learning. For example, \citep{baird1995residual} shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity \citep{li2021q,azar2013minimax}. The goal of this work is to bridge the gap between practical success of Q-learning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Q-learning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience replay (ER) \citep{mnih2015human}. While they have been observed to play a significant role in the practical success of Q-learning, a thorough theoretical understanding of how these two modifications improve the convergence behavior of Q-learning has been missing in literature. By carefully combining the Q-learning with OTL and \emph{reverse} experience replay (RER) (a form of experience replay), we present novel methods \qrex~and \qrexdr~(\qrex + data reuse). We show that \qrex~efficiently finds the optimal policy for linear MDPs and provide non-asymptotic bounds on sample complexity. Furthermore, we demonstrate that \qrexdr~in fact achieves near optimal sample complexity in the tabular setting, improving upon the existing results for vanilla Q-learning.

Research Areas