This is an extended version of the paper Logistic Markov Decision Processes that appeared in the Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp.2486-2493, Melbourne (2017).
Online and mobile interactions with users, in areas such as advertising and product or content recommendation, have been transformed by machine learning techniques. However, such methods have largely focused on myopic prediction, i.e., predicting immediate user response to system actions (e.g., ads or recommendations), without explicitly accounting for the long-term impact on user behavior, nor the potential need for planning action sequences. In this work, we propose the use of Markov decision processes (MDPs) to formulate the long-term decision problem and address two key questions that emerge in their application to user interaction.
The first focuses on model formulation, specifically, how best to construct MDP models of user interaction in a way that exploits the great successes of myopic prediction models. To this end, we propose a new model called logistic MDPs, an MDP formulation that allows the concise specification of transition dynamics. It does so by augmenting the natural factored form of dynamic Bayesian networks (DBNs) with user response variables that are captured by a logistic regression model (the latter being precisely the model used for myopic user interaction).
The second question we address is how best to solve large logistic MDPs of this type. A variety of methods have been proposed for solving MDPs that exploit the conditional independence reflected in the DBN representations, including approximate linear programming (ALP). Despite their compact form, logistic MDPs do not admit the same conditional independence as DBNs, nor do they satisfy the linearity requirements for standard ALP. We propose a constraint generation approach to ALP for logistic MDPs that circumvents these problems by: (a) recovering compactness by conditioning on the logistic response variable; and (b) devising two procedures, one exact and one approximate, that linearize the search for violated constraints in the master LP. For the approximation procedure, we also derive error bounds on the quality of the induced policy. We demonstrate the effectiveness of our approach on advertising problems with up to several thousand sparse binarized features (up to 2^54 and 2^39 actions).