A New Theoretical Framework for Fast and Accurate Online Decision-Making

Nicolo Cesa-Bianchi; Tommaso Cesari; Vianney Perchet; Yishay Mansour

A New Theoretical Framework for Fast and Accurate Online Decision-Making

Nicolo Cesa-Bianchi

Tommaso Cesari

Vianney Perchet

Yishay Mansour

(2021) (to appear)

Google Scholar

Abstract

We study
a setting in which a learner faces a sequence of A/B tests and has to make as many good decisions as possible within a given budget constraint. Each A/B test $n=1,2,\l$ is associated with an unknown (and potentially negative) reward $\mu_n \in [-1,1]$, drawn i.i.d.\ from an unknown and fixed distribution. For each A/B test $n$, the learner draws i.i.d.\ samples of a $\{-1,1\}$-valued random variable with mean $\mu_n$ sequentially until a halting criterion is met. The learner then decides to either accept the reward $\mu_n$ or to reject it and get $0$ instead. We measure the learner's performance as the average reward per time step. More precisely, as the sum of the expected rewards of the accepted $\mu_n$ divided by the total number of time steps. Note that this is different than the average reward per $\mu_n$. We design algorithms and prove data-dependent regret bounds against any set of policies based on arbitrary halting criteria and decision rules. Though our algorithms borrow ideas from multiarmed bandits, the two settings are significantly different and not comparable. In fact, the value of $\mu_n$ is never observed directly in our setting---unlike rewards in stochastic bandits. Moreover, the particular structure of our problem allows our regret bounds to be independent of the number of policies.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

A New Theoretical Framework for Fast and Accurate Online Decision-Making

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs