Contextual Blocking Bandits
Abstract
We study a novel variant of the multi-armed bandit problem, where at each time step, the
player observes an independently sampled context that determines the arms’ mean rewards.
However, playing an arm blocks it (across all contexts) for a fixed and known number of future time
steps. The above contextual setting, which captures important scenarios such as recommendation
systems or ad placement with diverse users, invalidates greedy solution techniques that are
effective for its non-contextual counterpart (Basu et al., NeurIPS19). Assuming knowledge of the
context distribution and the mean reward of each arm-context pair, we cast the problem as an
online bipartite matching problem, where the right-vertices (contexts) arrive stochastically and
the left-vertices (arms) are blocked for a finite number of rounds each time they are matched.
This problem has been recently studied in the full-information case, where competitive ratio
bounds have been derived. We focus on the bandit setting, where the reward distributions
are initially unknown; we propose a UCB-based variant of the full-information algorithm that
guarantees a O(log T)-regret w.r.t. an α-optimal strategy in T time steps, matching the Ω(log(T))
regret lower bound in this setting. Due to the time correlations caused by blocking, existing
techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel
concepts of delayed exploitation and opportunistic sub-sampling and combine them with ideas
from combinatorial bandits and non-stationary Markov chains coupling.
player observes an independently sampled context that determines the arms’ mean rewards.
However, playing an arm blocks it (across all contexts) for a fixed and known number of future time
steps. The above contextual setting, which captures important scenarios such as recommendation
systems or ad placement with diverse users, invalidates greedy solution techniques that are
effective for its non-contextual counterpart (Basu et al., NeurIPS19). Assuming knowledge of the
context distribution and the mean reward of each arm-context pair, we cast the problem as an
online bipartite matching problem, where the right-vertices (contexts) arrive stochastically and
the left-vertices (arms) are blocked for a finite number of rounds each time they are matched.
This problem has been recently studied in the full-information case, where competitive ratio
bounds have been derived. We focus on the bandit setting, where the reward distributions
are initially unknown; we propose a UCB-based variant of the full-information algorithm that
guarantees a O(log T)-regret w.r.t. an α-optimal strategy in T time steps, matching the Ω(log(T))
regret lower bound in this setting. Due to the time correlations caused by blocking, existing
techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel
concepts of delayed exploitation and opportunistic sub-sampling and combine them with ideas
from combinatorial bandits and non-stationary Markov chains coupling.