This is an extended version of a paper by the same title that appeared in the Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI-16), pp.52--61, New York, 2016.
We consider the problem of budget (or other resource) allocation in sequential decision problems involving a large number of concurrently running sub-processes, whose only interaction is through their gradual consumption of budget (or the resource in question). We use the case of an advertiser interacting with a large population of target customers as a primary motivation. We consider two complementary problems that comprise our key contributions.
Our first contribution addresses the problem of computing MDP value functions as a function of the available budget. In contrast to standard constrained MDPs—which find optimal value functions given a fixed expected budget constraint—our aim is to assess the tradeoff between expected budget spent and the value attained when using that budget optimally. We show that optimal value functions are concave in budget. More importantly, in the finite-horizon case, we show there are a finite number of useful budget levels. This gives rise to piecewise-linear, concave value functions (piecewise-constant if we restrict to deterministic policies) with an representation that can be computed readily via dynamic programming. This representation also supports natural approximations. Our model not only allows the assessment of budget/value tradeoffs (e.g., to find the “sweet spot” in spend), but plays an important role in the allocation of budget across competing subprocesses.
Our second contribution is a method for constructing a policy that prescribes the joint policy to be taken across all sub-processes given the joint state of the system, subject to a global budget constraint. We cast the problem as a weakly coupled MDP in which budget is allocated online to the individual subprocesses based on its observed (initial) state and the subprocess-specific value function. We show that the budget allocation process can be cast as a multi-item, multiple-choice knapsack problem (MCKP), which admits an efficient greedy algorithm to determine optimal allocations. We also discuss the possibility of online, per-stage re-allocation of budget to adaptively satisfy strict rather than expected budget constraints.