Online Learning and Bandits with Queried Hints

Aditya Bhaskara

Sreenivas Gollapudi

Sungjin Im

Kostas Kollias

Kamesh Munagala

ITCS (2023)

Google Scholar

Abstract

We consider the classic stochastic multi-armed bandit (MAB) problem, when at each step, the online policy is given the extra power to probe (or observe) the rewards of $k$ arms before playing one of them. We derive new algorithms whose regret bounds in this model have exponentially better dependence on the time horizon when compared to the classic regret bounds. In particular, we show that $k=3$ probes suffice to achieve parameter-independent constant regret of $O(n^2)$, where $n$ is the number of arms. Such regret bounds cannot be achieved even with full feedback {\em after} the play (that is, in the experts model), showcasing the power of even a few probes {\em before} making the play. We present simulations that show benefit of our policies even on benign instances.

Research Areas

Machine Intelligence
Algorithms and Theory

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Online Learning and Bandits with Queried Hints

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Online Learning and Bandits with Queried Hints

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities