Optimistic Whittle Index Policy: Online Learning for Restless Bandits
Abstract
Restless multi-armed bandits (RMABs) are an extension of multi-armed bandits (MABs) with state information associated with arms, where the states evolve restlessly with different transition probabilities depending on whether the arms are pulled. The additional state information in RMABs captures broader applications with state dependency, including digital marketing and healthcare recommendation. However, solving RMABs requires information on transition dynamics, which is often not available upfront. This paper considers learning the transition probabilities in an RMAB setting while maintaining small regret. We use the confidence bounds of transition probabilities to define an optimistic Whittle index policy to solve the RMAB problem while maintaining sub-linear regret compared to the benchmark. Our algorithm, UCWhittle, leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than other online learning baselines without structural information. We evaluate UCWhittle on real-world healthcare data to help reduce maternal mortality.