Google Research

Contextual Blocking Bandits

  • Constantine Caramanis
  • Orestis Papadigenopoulas
  • Sanjay Shakkottai
  • Soumya Basu
Advances in Neural Information Processing Systems 33 (2021) (to appear)

Abstract

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes a context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. This model extends the blocking bandits model (Basu et al., NeurIPS19) to a contextual setting, and captures important scenarios such as recommendation systems or ad placement with diverse users, and processing diverse pool of jobs. This contextual setting, however, invalidates greedy solution techniques that are effective for its non-contextual counterpart. Assuming knowledge of the mean reward for each arm-context pair, we design a randomized LP-based algorithm which is α-optimal in (large enough) T time steps, where α=dmax(1−epsilon)/(2dmax−1) for any epsilon>0, and dmax is the maximum delay of the arms. In the bandit setting, we show that a UCB based variant of the above online policy guarantees \O(logT) regret w.r.t. the α-optimal strategy in T time steps, which matches the \Omega(log(T)) regret lower bound in this setting. Due to the time correlation caused by the blocking of arms, existing techniques for upper bounding regret fail. As a first, in the presence of such temporal correlations, we combine ideas from coupling of non-stationary Markov chains and opportunistic sub-sampling with sub-optimality charging techniques from combinatorial bandits to prove our regret upper bounds.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work