- Branislav Kveton
- Michael Konobeev
- Manzil Zaheer
- Martin Mladenov
- Craig Boutilier
- Chih-wei Hsu
- Csaba Szepesvari
Abstract
Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work