Online Learning and Bandits with Queried Hints
Abstract
We consider the classic stochastic multi-armed bandit (MAB) problem, when at each step, the online policy is given the extra power to probe (or observe) the rewards of $k$ arms before playing one of them. We derive new algorithms whose regret bounds in this model have exponentially better dependence on the time horizon when compared to the classic regret bounds. In particular, we show that $k=3$ probes suffice to achieve parameter-independent constant regret of $O(n^2)$, where $n$ is the number of arms. Such regret bounds cannot be achieved even with full feedback {\em after} the play (that is, in the experts model), showcasing the power of even a few probes {\em before} making the play. We present simulations that show benefit of our policies even on benign instances.