Google Research

Online Learning and Bandits with Queried Hints

ITCS (2023)


We consider the classic stochastic multi-armed bandit (MAB) problem, when at each step, the online policy is given the extra power to probe (or observe) the rewards of $k$ arms before playing one of them. We derive new algorithms whose regret bounds in this model have exponentially better dependence on the time horizon when compared to the classic regret bounds. In particular, we show that $k=3$ probes suffice to achieve parameter-independent constant regret of $O(n^2)$, where $n$ is the number of arms. Such regret bounds cannot be achieved even with full feedback {\em after} the play (that is, in the experts model), showcasing the power of even a few probes {\em before} making the play. We present simulations that show benefit of our policies even on benign instances.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work