Google Research

Online Convex Optimization in Adversarial Markov Decision Processes

ICML 2019 (2019) (to appear)


We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. We show $\tilde{O}(L|X|\sqrt{|A|T})$ regret bound, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria\footnote{A {\em performance criterion} aggregates all the losses of a single episode to a single objective we would like to minimize.}, as well as improve previous regret bounds.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work