# A New Theoretical Framework for Fast and Accurate Online Decision-Making

(2021) (to appear)

## Abstract

We study a setting in which a learner faces a sequence of A/B tests and has to make as many good decisions as possible within a given budget constraint. Each A/B test $n=1,2,\l$ is associated with an unknown (and potentially negative) reward $\mu_n \in [-1,1]$, drawn i.i.d.\ from an unknown and fixed distribution. For each A/B test $n$, the learner draws i.i.d.\ samples of a $\{-1,1\}$-valued random variable with mean $\mu_n$ sequentially until a halting criterion is met. The learner then decides to either accept the reward $\mu_n$ or to reject it and get $0$ instead. We measure the learner's performance as the average reward per time step. More precisely, as the sum of the expected rewards of the accepted $\mu_n$ divided by the total number of time steps. Note that this is different than the average reward per $\mu_n$. We design algorithms and prove data-dependent regret bounds against any set of policies based on arbitrary halting criteria and decision rules. Though our algorithms borrow ideas from multiarmed bandits, the two settings are significantly different and not comparable. In fact, the value of $\mu_n$ is never observed directly in our setting---unlike rewards in stochastic bandits. Moreover, the particular structure of our problem allows our regret bounds to be independent of the number of policies.