Online Convex Optimization in Adversarial Markov Decision Processes
Abstract
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner.
We show $\tilde{O}(L|X|\sqrt{|A|T})$ regret bound, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode.
Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria\footnote{A {\em performance criterion} aggregates all the losses of a single episode to a single objective we would like to minimize.}, as well as improve previous regret bounds.