Meta-Thompson Sampling

Branislav Kveton
Michael Konobeev
Martin Mladenov
Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pp. 5884-5893


Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior.

Research Areas