Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Haipeng Luo; tal lancewicki; Tiancheng Jin; Yishay Mansour; Aviv Rosenberg

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Haipeng Luo

tal lancewicki

Tiancheng Jin

Yishay Mansour

Aviv Rosenberg

NeurIPS 2022

Google Scholar

Abstract

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately.
However, in practice feedback is often observed in delay.
This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback.
More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary.
We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs