Jump to Content

Online Markov Decision Processes with Aggregate Bandit Feedback

Haim Kaplan
Yishay Mansour
Conference on Learning Theory (2021)


We study online learning of finite-horizon Markov Decision Processes (MDPs) with adversarially changing loss functions and unknown dynamics. In each episode, the learner observes a trajectory realized by her policy chosen for this episode. In addition, the learner suffers and observes the loss accumulated along the trajectory which we call aggregate bandit feedback. The learner, however, never observes any additional information about the loss; in particular, the individual losses suffered along the trajectory. Our main result is a computationally-efficient algorithm with \sqrt{K} regret for this setting, where K is the number of episodes. We efficiently reduce \emph{Online MDPs with Aggregate Bandit Feedback} to a novel setting: Distorted Linear Bandits (DLB). This setting is a robust generalization of linear bandits in which selected actions are adversarially perturbed. We give a computationally-efficient online learning algorithm for DLB and prove a \sqrt{T} regret bound, where T is the number of time steps. Our algorithm is based on a schedule of increasing learning rates used in Online Mirror Descent with a self-concordant barrier regularization. We use the DLB algorithm to derive our main result of \sqrt{K} regret.

Research Areas