Perturbed-History Exploration in Stochastic Multi-Armed Bandits

Branislav Kveton; Csaba Szepesvari; Mohammad Ghavamzadeh; Craig Boutilier

Perturbed-History Exploration in Stochastic Multi-Armed Bandits

Branislav Kveton

Csaba Szepesvari

Mohammad Ghavamzadeh

Craig Boutilier

Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2786-2793

Download Google Scholar

Abstract

We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Perturbed-History Exploration in Stochastic Multi-Armed Bandits

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs