Reinforcement Learning with Feedback Graphs
Abstract
We study reinforcement learning in tabular MDPs where the agent receives additional side observations per step in the form of several transition samples -- e.g. from data augmentation. We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage side observations for more sample-efficient learning. We give a regret bound that predominantly depends on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of side observations. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the well-studied bandit setting and propose a new algorithm that can use such a dominating set to learn a near-optimal policy faster.