Jump to Content
Craig Boutilier

Craig Boutilier

Craig Boutilier is Principal Scientist at Google. He works on various aspects of decision making under uncertainty, with a current focus on sequential decision models: reinforcement learning, Markov decision processes, temporal models, etc.

Positions and Appointments:
He was a Professor in the Department of Computer Science at the University of Toronto (on leave) and Canada Research Chair in Adaptive Decision Making for Intelligent Systems. He received his Ph.D. in Computer Science from the University of Toronto in 1992, and worked as an Assistant and Associate Professor at the University of British Columbia from 1991 until his return to Toronto in 1999. He served as Chair of the Department of Computer Science at Toronto from 2004-2010. He was co-founder (with Tyler Lu) of Granata Decision Systems from 2012-2015, until his move to Google in 2015.

Boutilier was a consulting professor at Stanford University from 1998-2000, an adjunct professor at the University of British Columbia from 1999-2010, and a visiting professor at Brown University in 1998, at the University of Toronto in 1997-98, at Carnegie Mellon University in 2008-09, and at Université Paris-Dauphine (Paris IX) in the spring of 2011. He served on the Technical Advisory Board of CombineNet, Inc. from 2001 to 2010.

Research:
Boutilier's current research efforts focus on various aspects of decision making under uncertainty, including the use of generative models and LLMs, in areas such as: recommender systems, preference modeling and elicitation, mechanism design, game theory and multiagent decision processes, economic models, social choice, computational advertising, Markov decision processes, reinforcement learning and probabilistic inference. His research interests have spanned a wide range of topics, from knowledge representation, belief revision, default reasoning, and philosophical logic, to probabilistic reasoning, decision making under uncertainty, multiagent systems, and machine learning.

Research & Academic Service:
Boutilier is a past Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR). He was a past Associate Editor with the ACM Transactions on Economics and Computation (TEAC), the Journal of Artificial Intelligence Research (JAIR), the Journal of Machine Learning Research (JMLR), and Autonomous Agents and Multiagent Systems (AAMAS); and he has sat on the editorial/advisory boards of several other journals. Boutilier has organized several international conferences and workshops, including his work as Program Chair of the Twenty-first International Joint Conference on Artificial Intelligence (IJCAI-09) and Program Chair of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-2000). He has also served on the conference program committees of roughly 75 leading international conferences.

Awards and Honors:
Boutilier is a Fellow of the Royal Society of Canada (RSC), the Association for Computing Machinery (ACM) and the Association for the Advancement of Artificial Intelligence (AAAI). He was the recipient of the 2018 ACM/SIGAI Autonomous Agents Research Award, He was awarded a Tier I Canada Research Chair, an Isaac Walton Killam Research Fellowship, and an IBM Faculty Award. He received the Killam Teaching Award from the University of British Columbia in 1997. He has also received a number of Best Paper awards including:

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Model-Free Preference Elicitation
    Carlos Martin
    Tuomas Sandholm
    Ofer Meshi
    Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, South Korea (2024) (to appear)
    Preview abstract Elicitation of user preferences is becoming an important approach for improving the qualityof recommendations, especially when there is little or no user history. In this setting, arecommender system interacts with the user by iteratively presenting elicitation questionsand recording their responses. Various criteria have been proposed for optimizing thesequence of queries in order to improve user understanding and thereby the quality ofdownstream recommendations. A compelling approach for preference elicitation is theExpected Value of Information (EVOI), a Bayesian approach which computes the expectedgain in user utility for possible queries. Previous work on EVOI has focused on probabilisticmodels of users for computing posterior utilities. In contrast, in this work we exploremodel-free variants of EVOI which rely on function approximations in order to avoid strongmodeling assumptions. Specifically, we propose to learn a user response model and a userutility model from data which is often available in real-world systems, and to use thesemodels in EVOI in place of the probabilistic models. We show that our approach leads toimproved elicitation performance. View details
    Demystifying Embedding Spaces using Large Language Models
    Yinlam Chow
    Jihwan Jeong
    Lior Shani
    Martin Mladenov
    The Twelfth International Conference on Learning Representations (2024)
    Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details
    Minimizing Live Experiments in Recommender Systems: User Simulation to Evaluate Preference Elicitation Policies
    Martin Mladenov
    Ofer Meshi
    James Pine
    Hubert Pham
    Shane Li
    Xujian Liang
    Anton Polishko
    Ben Scheetz
    Proceedings of he 47th International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR-24), Washington, DC (2024) (to appear)
    Preview abstract Evaluation of policies in recommender systems (RSs) typically involves A/B testing using live experiments on real users to assess a new policy's impact on relevant metrics. This ``gold standard'' comes at a high cost, however, in terms of cycle time, user cost, and potential user retention. In developing policies for onboarding new users, these costs can be especially problematic, since on-boarding occurs only once. In this work, we describe a simulation methodology used to augment (and reduce) the use of live experiments. We illustrate its deployment for the evaluation of preference elicitation algorithms used to onboard new users of the YouTube Music platform. By developing counterfactually robust user behavior models, and a simulation service that couples such models with production infrastructure, we are able to test new algorithms in a way that reliably predicts their performance on key metrics when deployed live, sometimes more reliably than live experiments due to the scale at which simulation can be realized. We describe our domain, our simulation models and platform, results of experiments and deployment, and suggest future steps needed to further realistic simulation as a powerful complement to live experiments. View details
    Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors
    Christina Göpfert
    Alex Haig
    Yinlam Chow
    Ivan Vendrov
    Tyler Lu
    Hubert Pham
    Mohammad Ghavamzadeh
    ACM Transactions on Recommender Systems (2024)
    Preview abstract Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue). Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user's semantic intent from the open-ended terms or attributes often used to describe a desired item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) (Kim, et al., 2018) a recently developed approach for model interpretability in machine learning, we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in recommender systems. One novel feature of our approach is its ability to distinguish objective and subjective attributes (both subjectivity of degree and of sense), and associate different senses of subjective attributes with different users. We demonstrate on both synthetic and real-world data sets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations through interactive item critiquing. View details
    Modeling Recommender Ecosystems: Research Challenges at the Intersection of Mechanism Design, Reinforcement Learning and Generative Models
    Martin Mladenov
    Proceedings of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver (2024) (to appear)
    Preview abstract Modern recommender systems lie at the heart of complex ecosystems that couple the behavior of users, content providers, advertisers, and other actors. Despite this, the focus of the majority of recommender research---and most practical recommenders of any import---is on the \emph{local, myopic} optimization of the recommendations made to individual users. This comes at a significant cost to the \emph{long-term utility} that recommenders could generate for its users. We argue that explicitly modeling the incentives and behaviors of all actors in the system---and the interactions among them induced by the recommender's policy---is strictly necessary if one is to maximize the value the system brings to these actors and improve overall ecosystem ``health.'' Doing so requires: optimization over long horizons using techniques such as \emph{reinforcement learning}; making inevitable tradeoffs among the utility that can be generated for different actors using the methods of \emph{social choice}; reducing information asymmetry, while accounting for incentives and strategic behavior, using the tools of \emph{mechanism design}; better modeling of both user and item-provider behaviors by incorporating notions from \emph{behavioral economics and psychology}; and exploiting recent advances in \emph{generative and foundation models} to make these mechanisms interpretable and actionable. We propose a conceptual framework that encompasses these elements, and articulate a number of research challenges that emerge at the intersection of these different disciplines. View details
    Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management
    Dhawal Gupta
    Yinlam Chow
    Mohammad Ghavamzadeh
    Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS-23), New Orleans (2023)
    Preview abstract Reinforcement learning (RL) has offered great promise for developing dialogue management (DM) agents that avoid being short-sighted, conduct rich conversations, and maximize overall user satisfaction. Despite recent developments in deep RL and language models (LMs), using RL to power conversational chatbots remain a formidable challenge. This is because deep RL algorithms require online exploration to learn effectively, but collecting fresh human-bot interactions can be expensive and unsafe. This issue is exacerbated by the combinatorial action space that these algorithms need to handle, as most LM agents generate responses at the word-level. Leveraging the recent advances of Mixture-of-Expert Language Models (MoE-LMs) that capture diverse semantics, generate utterances of different intents, and are amenable for multi-turn DM, we develop a gamut of offline RL algorithms that excel in dialogue planning. Through exploiting the MoE-LM structure, our methods significantly reduce the action space and improve the efficacy of RL DM. We compare that with SOTA methods on open-domain dialogues to demonstrate their effectiveness both in the diversity of generated utterances and the overall DM performance. View details
    Reinforcement Learning with History Dependent Dynamic Contexts
    Nadav Merlis
    Lior Shani
    Martin Mladenov
    Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, Hawaii
    Preview abstract We introduce a framework for modeling and solving reinforcement learning problems in non-Markovian, history-dependent environments. Our framework, called the Dynamic Contextual Markov Decision Process (DCMDP), generalizes the contextual MDP framework to handle non-Markov environments where contexts change over time. To overcome the exponential dependence on history, we leverage an aggregated mapping of previous visits to states, actions and contexts to construct an optimistic upper confidence-based algorithm, for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm that addresses history-dependent contexts, by planing in a latent space and using optimism over history-dependent features. We demonstrate the efficiency and performance of our approach on a recommendation task using the MovieLens dataset, in which the user's behavior is influenced by the agent's recommendations and changes over time. View details
    Fine-tuning Text-to-Image Diffusion Models via Reinforcement Learning from Human Feedback
    Ying Fan
    Olivia Watkins
    Yuqing Du
    Hao Liu
    Moonkyung Ryu
    Pieter Abbeel
    Mohammad Ghavamzadeh
    Kangwook Lee
    Kimin Lee
    Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS-23), New Orleans (2023)
    Preview abstract Despite significant progress in text-to-image synthesis, current models often produce images that do not align well with text prompts. To overcome this challenge, recent works have collected a large dataset of human feedback and trained a reward function that aligns with human evaluations. However, optimizing text-to-image models to maximize this reward function remains a challenging problem. In this work, we investigate reinforcement learning (RL) to fine-tune text-to-image models. Specifically, we define the fine-tuning task as an RL problem, tailored for diffusion models. We then update the pre-trained text-to-image diffusion models using a policy gradient algorithm to maximize the scores of the reward model, based on human feedback. We investigate several design choices, such as KL regularization, value learning, and balancing regularization coefficients, and find that careful consideration of these design choices is crucial for effective RL fine-tuning. Our experiments demonstrate that RL fine-tuning is more effective in improving pre-trained models than supervised fine-tuning, in terms of both alignment and fidelity. View details
    A Mixture-of-Expert Approach to RL-based Dialogue Management
    Yinlam Chow
    Ofir Nachum
    Dhawal Gupta
    Moonkyung Ryu
    Mohammad Ghavamzadeh
    Proceedings of the Eleventh International Conference on Learning Representations (ICLR-23), Kigali, Rwanda (2023)
    Preview abstract Despite recent advancements in language models (LMs), their application to dialogue management (DM) and ability to carry on rich conversations remains a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (often outputting generic utterances) and maximizes overall user satisfaction. However, existing RL approaches focus on training an agent that operates at the word level. Since generating semantically-correct and sensible utterances from a large vocabulary space is combinatorially complex, RL can struggle to produce engaging dialogue, even if warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture-of-expert (MoE) approach, which consists of (i) a language representation that captures diverse information, (ii) several modulated LMs (or experts) to generate candidate utterances, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. This MoE approach provides greater flexibility to generate sensible utterances of different intents, and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in the diversity and sensibility of the generated utterances as well as the overall DM performance. View details
    Building Human Values into Recommender Systems: An Interdisciplinary Synthesis and Open Problems
    Jonathan Stray
    Alon Halevy
    Parisa Assar
    Dylan Hadfield-menell
    Amar Ashar
    Chloe Bakalar
    Lex Beattie
    Michael Ekstrand
    Claire Leibowicz
    Connie Moon Sehat
    Sara Johansen
    Lianne Kerlin
    David Vickrey
    Spandana Singh
    Sanne Vrijenhoek
    Amy Zhang
    Mckane Andrus
    Natali Helberger
    Polina Proutskova
    Tanushree Mitra
    Nina Vasan
    ACM Transactions on Recommender Systems (2023)
    Preview abstract Recommender systems are the algorithms which select, filter, and personalize content across many of the world’s largest platforms and apps. As such, their positive and negative effects on individuals and on societies have been extensively theorized and studied. The overarching question that arises is whether recommender systems align with the values of the individuals and societies that they serve. Addressing this question in a principled fashion requires technical knowledge of recommender design and their practice, but critically depends on insights from diverse fields including social science, ethics, economics, psychology, policy and law. This paper is a multidisciplinary effort to define a common language for addressing questions around human-value alignment for recommender systems, to synthesize the state of practice and insights from different perspectives, and to propose open problems. We propose a set of values that seem most relevant to recommender systems operating across different domains. We look at values from three different perspectives: 1) measurement, which is a key element of operationalizing values, 2) design, reflecting current approaches and open challenges to implementing these values, and 3) policy, the regulatory approaches which could provide appropriate incentives and standards for recommender system operators. View details
    Thompson Sampling with a Mixture Prior
    Joey Hong
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AI-Stats-22) (2022), pp. 7565-7586
    Preview abstract We consider posterior sampling in online decision-making problems where the uncertain environment is sampled from a mixture distribution. We incorporate this structure in a natural way by initializing a Thompson sampling algorithm with a mixture prior. We provide a novel, general outline for analyzing the regret of Thompson sampling with a mixture prior. We also use this to derive Bayes regret bounds in both a linear bandit and tabular MDP settings. The regret bounds depend on the confidence widths of each component of the mixture prior, and converge to solely identifying the correct component when confidence widths are small. Finally, we demonstrate the empirical effectiveness of our proposed algorithm in a synthetic and real-world bandit problem involving multi-task image classification. View details
    Subjective Attributes in Conversational Recommendation Systems: Challenges and Opportunities
    Filip Radlinski
    Ivan Vendrov
    Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI-22) (2022), pp. 12287-12293
    Preview abstract The ubiquity of recommender systems has increased the need for higher-bandwidth, natural and efficient communication with users. This need is increasingly filled by recommenders that support natural language interaction, often conversationally. Given the inherent semantic subjectivity present in natural language, we argue that modeling subjective attributes in recommenders is a critical, yet understudied, avenue of AI research. We propose a novel framework for understanding different forms of subjectivity, examine various recommender tasks that will benefit from a systematic treatment of subjective attributes, and outline a number of research challenges. View details
    Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors
    Christina Göpfert
    Yinlam Chow
    Ivan Vendrov
    Tyler Lu
    WWW22: The Web Conference 2022, Lyon, France, pp. 2411-2421
    Preview abstract Interactive Recommender Systems (RSs) have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional RSs (e.g., clicks, item consumption, ratings), allowing users to express intent, preferences, constraints, and contexts in a richer fashion using natural language. Still, more research is needed to find the most effective ways to use this feedback. One major challenge is inferring a user's intended semantic intent from given the open-ended terms (say, attributes or tags) used to describe a desired item, and utilize that to refine recommendation results. Leveraging Concept Activation Vectors (CAVs) [13], we develop a framework to learn a representation that captures the semantics of such attributes and connect them to user preferences and behaviors in RSs. One novel feature of our approach is its ability to distinguish objective and subjective attributes (including subjectivity of degree and of sense) and associate different senses of subjective attributes with different user. We demonstrate on both synthetic and real-world datasets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations. View details
    An Adversarial Variational Inference Approach for Travel Demand Calibration of Urban Traffic Simulators
    Martin Mladenov
    Proceedings of the 30th ACM SIGSPATIAL Intl. Conf. on Advances in Geographic Information Systems (SIGSPATIAL-22), Seattle, WA (2022) (to appear)
    Preview abstract This paper considers the calibration of travel demand inputs, defined as a set of origin-destination matrices (ODs), for stochastic microscopic urban traffic simulators. The goal of calibration is to find a (set of) travel demand input(s) that replicate sparse field count data statistics. While traditional approaches use only first-order moment information from the field data, it is well known that the OD calibration problem is underdetermined in realistic networks. We study the value of using higher-order statistics from spatially sparse field data to mitigate underdetermination, proposing a variational inference technique that identifies an OD distribution. We apply our approach to a high-dimensional setting in Salt Lake City, Utah. Our approach is flexible—it can be readily extended to account for arbitrary types of field data (e.g., road, path or trip data). View details
    IMO^3: Interactive Multi-Objective Off-Policy Optimization
    Nan Wang
    Hongning Wang
    Branislav Kveton
    Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna (2022), pp. 3523-3529 (to appear)
    Preview abstract Most real-world optimization problems have multiple objectives. A system designer needs to find a policy that trades off these objectives to reach a desired operating point. This problem has been studied extensively in the setting of known objective functions. However, we consider a more practical but challenging setting of unknown objective functions. In industry, optimization under this setting is mostly approached with online A/B testing, which is often costly and inefficient. As an alternative, we propose Interactive Multi-Objective Off-policy Optimization (IMO3). The key idea of IMO3 is to interact with a system designer using policies evaluated in an off-policy fashion to uncover which policy maximizes her unknown utility function. We theoretically show that IMO3 identifies a near-optimal policy with high probability, depending on the amount of designer feedback and training data for off-policy estimation. We demonstrate its effectiveness empirically on several multi-objective optimization problems. View details
    Meta-Thompson Sampling
    Branislav Kveton
    Michael Konobeev
    Martin Mladenov
    Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pp. 5884-5893
    Preview abstract Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior. View details
    Preview abstract Most existing recommender systems primarily focus on the users (content consumers), matching users with the most relevant contents, with the goal of maximizing user satisfaction on the platform. However, given that content providers are playing an increasingly critical role through content creation, largely determining the content pool available for recommendation, a natural question that arises is: Can we design recommenders taking into account utilities of both users and content providers? By doing so, we hope to sustain the flourish of more content providers and a diverse content pool for long-term user satisfaction. Understanding the full impact of recommendations on both user and content provider groups is challenging. This paper aims to serve as a research investigation on one approach toward building a content provider-aware recommender, and evaluating its impact under a simulated setup. To characterize the users-recommender-providers interdependence, we complement user modeling by formalizing provider dynamics as a parallel Markov Decision Process of partially observable states transited by recommender actions and user feedback. We then build a REINFORCE recommender agent, coined EcoAgent, to optimize a joint objective of user utility and the counterfactual utility lift of the content provider associated with the chosen content, which we show to be equivalent to maximizing overall user utility and utilities of all content providers on the platform. To evaluate our approach, we also introduce a simulation environment capturing the key interactions among users, providers, and the recommender. We offer a number of simulated experiments that shed light to both the benefits and the limitations of our approach. These results serve to understand how and when a content-provider aware recommender agent is of benefit in building multi-stakeholder recommender systems. View details
    RecSim NG: Toward Principled Uncertainty Modeling for Recommender Ecosystems
    Martin Mladenov
    Vihan Jain
    Christopher Colby
    Nicolas Mayoraz
    Hubert Pham
    Dustin Tran
    Ivan Vendrov
    ArXiv (2021)
    Preview abstract The development of recommender systems that optimize multi-turn interaction with users, and model the interactions of different agents (e.g., users, content providers, vendors) in the recommender ecosystem have drawn increasing attention in recent years. Developing and training models and algorithms for such recommenders can be especially difficult using static datasets, which often fail to offer the types of counterfactual predictions needed to evaluate policies over extended horizons. To address this, we develop RecSim NG, a probabilistic platform for the simulation of multi-agent recommender systems. RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. It offers: a powerful, general probabilistic programming language for agent-behavior specification; tools for probabilistic inference and latent-variable model learning, backed by automatic differentiation and tracing; a TensorFlow-based runtime for running simulations on accelerated hardware. We describe RecSim NG and illustrate how it can be used to create transparent, configurable, end-to-end models of a recommender ecosystem, complemented by a small set of simple use cases that demonstrate how RecSim NG can help both researchers and practitioners easily develop and train novel algorithms for recommender systems. A short version of this paper was published at RecSys 2020. View details
    Preview abstract The use of voting schemes based on rankings of alternatives to solve social choice problems can often impose significant burden on voters, both in terms of communication and cognitive requirements. In this paper, we develop techniques for preference elicitation in voting settings (i.e., vote elicitation) that can alleviate this burden by minimizing the amount of preference information needed to find (approximately or exactly) optimal outcomes. We first describe robust optimization techniques for determining winning alternatives given partial preference information (i.e., partial rankings) using the notion of minimax regret. We show that the corresponding computational problem is tractable for some important voting rules, and intractable for others. We then use the solution to the minimax-regret optimization as the basis for vote elicitation schemes that determine appropriate preference queries for voters to quickly reduce potential regret. We apply these techniques to multi-winner social choice problems as well, in which a slate of alternatives must be selected, developing both exact and greedy robust optimization procedures. Empirical results on several data sets validate the effectiveness of our techniques. View details
    Differentiable Meta-Learning of Bandit Policies
    Branislav Kveton
    Martin Mladenov
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 2122-2134
    Preview abstract Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances. View details
    Demonstrating Principled Uncertainty Modeling for Recommender Ecosystems with RecSim NG
    Martin Mladenov
    Vihan Jain
    Christopher Colby
    Nicolas Mayoraz
    Hubert Pham
    Dustin Tran
    Ivan Vendrov
    RecSys '20: Fourteenth ACM Conference on Recommender Systems (2020), pp. 591-593
    Preview abstract We develop RecSim NG, a probabilistic platform that supports natural, concise specification and learning of models for multi-agent recommender systems simulation. RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. An extended version of this paper is available as arXiv:2103.08057. View details
    BRPO: Batch Residual Policy Optimization
    Sungryull Sohn
    Yinlam Chow
    Ofir Nachum
    Honglak Lee
    Proceedings of the Twenty-ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan (2020), pp. 2824-2830
    Preview abstract In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the- art performance in a number of tasks. View details
    Preview abstract Preference elicitation is an important component in many AI applications, including decision support and recommender systems. Such systems must assess user preferences, based on interactions with their users, and make recommendations using (possibly incomplete and imprecise) beliefs about those preferences. Mechanisms for explicit preference elicitation---asking users to answer direct queries about their preferences---can be of great value; but due to the cognitive and time cost imposed on users, it is important to minimize the number of queries by asking those that have high (expected) value of information. An alternative approach is to simply make recommendations and have users provide feedback (e.g., accept a recommendation or critique it in some way) and use this more indirect feedback to gradually improve the quality of the recommendations. Due to inherent uncertainty about a user's true preferences, often a set of recommendations is presented to the user at each stage. Conceptually, a set of recommendations can also be viewed as choice query, in which the user indicates which option is most preferred from that set. Because of the potential tension between making a good set recommendation and asking an informative choice query, we explore the connection between the two. We consider two different models of preference uncertainty and optimization: (a) a Bayesian framework in which a posterior over user utility functions is maintained, optimal recommendations are assessed using expected utility, and queries are assessed using expected value of information; and (b) a minimax-regret framework in which user utility uncertainty is strict (represented by a polytope), recommendations are made using the minimax-regret robustness criterion, and queries are assessed using worst-case regret reduction. We show that, somewhat surprisingly, in both cases, there is no tradeoff to be made between good recommendations and good queries: we prove that the optimal recommendation set of size k is also an optimal choice query of size k. We also examine the case where user responses to choice queries are error prone (using both constant and mixed multinomial logit noise models) showing the results are robust to this form of noise. In both frameworks, our theoretical results have practical consequences for the design of interactive recommenders. Our results also allow us to design efficient algorithms to compute optimal query/recommendation sets. We develop several such algorithms (both exact and approximate) for both settings and provide empirical validation of their performance. View details
    ConQUR: Mitigating Delusional Bias in Deep Q-learning
    DiJia (Andy) Su
    Tyler Lu
    Dale Schuurmans
    Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML-20), Vienna, Austria (2020)
    Preview abstract Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically. View details
    CaQL: Continuous Action Q-Learning
    Moonkyung Ryu
    Yinlam Chow
    Proceedings of the Eighth International Conference on Learning Representations (ICLR-20), Addis Ababa, Ethiopia (2020)
    Preview abstract In this work we propose CaQL, a value-based reinforcement learning (RL) algorithm that handles continuous actions, whose Q-function is modeled by a generic feed-forward neural network. We show that the problem of calculating Bellman residual can be posed as a mixed-integer linear programming (MILP) problem. Furthermore to reduce the complexity of computing Bellman residual, we propose three techniques (i) dynamic tolerance, (ii) dual filter, (iii) clustering to speed up the computation of max-Q values. Finally, to illustrate the efficiency of CaQL, we compare it with state-of-the-art RL algorithms on benchmark continuous control problems that have various action constraints, and show that CaQL significantly outperforms policy-based methods in heavily constrained environments. View details
    Optimizing Long-term Social Welfare in Recommender Systems:A Constrained Matching Approach
    Martin Mladenov
    Elliot Creager
    Omer Ben-Porat
    Richard Zemel
    Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML-20), Vienna, Austria (2020)
    Preview abstract Most recommender systems (RS) research assumes that a user’s utility can be maximized independently of the utility of the other agents (e.g., other users, content providers). In realistic settings, this is often not true---the dynamics of an RS ecosystem couple the long-term utility of all agents. In this work, we explore settings in which content providers cannot remain viable unless they receive a certain level of user engagement. We formulate the recommendation problem in this setting as one of equilibrium selection in the induced dynamical system, and show that it can be solved as an optimal constrained matching problem. Our model ensures the system reaches an equilibrium with maximal social welfare supported by a sufficiently diverse set of viable providers. We demonstrate that even in a simple, stylized dynamical RS model, the standard myopic approach to recommendation---always matching a user to the best provider---performs poorly. We develop several scalable techniques to solve the matching problem, and also draw connections to various notions of user regret and fairness, arguing that these outcomes are fairer in a utilitarian sense. View details
    Randomized Exploration in Generalized Linear Bandits
    Branislav Kveton
    Lihong Li
    Mohammad Ghavamzadeh
    23rd International Conference on Artificial Intelligence and Statistics (2020)
    Preview abstract We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration. View details
    Latent Bandits Revisited
    Joey Hong
    Branislav Kveton
    Yinlam Chow
    Amr Ahmed
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 13423-13433
    Preview abstract A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning, where complex models can be learned offline and the agent identifies the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper confidence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach. View details
    Gradient-based Optimization for Bayesian Preference Elicitation
    Ivan Vendrov
    Tyler Lu
    Qingqing Huang
    Proceedings of the Thirty-fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York (2020), pp. 10292-10301
    Preview abstract Effective techniques for eliciting user preferences have taken on added importance as recommender systems (RSs) become increasingly interactive and conversational. A common and conceptually appealing Bayesian criterion for selecting queries is expected value of information (EVOI). Unfortunately, it is computationally prohibitive to construct queries with maximum EVOI in RSs with large item spaces. We tackle this issue by introducing a continuous formulation of EVOI as a differentiable network that can be optimized using gradient methods available in modern machine learning computational frameworks (e.g., TensorFlow, PyTorch). We exploit this to develop a novel Monte Carlo method for EVOI optimization, which is much more scalable for large item spaces than methods requiring explicit enumeration of items. While we emphasize the use of this approach for pairwise (or k-wise) comparisons of items, we also demonstrate how our method can be adapted to queries involving subsets of item attributes or “partial items,” which are often more cognitively manageable for users. Experiments show that our gradient-based EVOI technique achieves state-of-the-art performance across several domains while scaling to large item spaces. View details
    Preview abstract Ranking is a central task in machine learning and information retrieval. In this task, it is especially important to present the user with a slate of items that is appealing as a whole. This in turn requires taking into account interactions between items, since intuitively, placing an item on the slate affects the decision of which other items should be placed alongside it. In this work, we propose a sequence-to-sequence model for ranking called seq2slate. At each step, the model predicts the next "best" item to place on the slate given the items already selected. The sequential nature of the model allows complex dependencies between the items to be captured directly in a flexible and scalable way. We show how to learn the model end-to-end from weak supervision in the form of easily obtained click-through data. We further demonstrate the usefulness of our approach in experiments on standard ranking benchmarks as well as in a real-world recommendation system. View details
    Advantage Amplification in Slowly Evolving Latent-State Environments
    Martin Mladenov
    Ofer Meshi
    Dale Schuurmans
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 3165-3172
    Preview abstract Latent-state environments with long horizons, such as those faced by recommender systems, pose significant challenges for reinforcement learning (RL). In this work, we identify and analyze several key hurdles for RL in such environments, including belief state error and small action advantage. We develop a general principle called advantage amplification that can overcome these hurdles through the use of temporal abstraction. We propose several aggregation methods and prove they induce amplification in certain settings. We also bound the loss in optimality incurred by our methods in environments where latent state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task. View details
    SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2592-2599
    Preview abstract Reinforcement learning (RL) methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube. View details
    Experiential Preference Elicitation for Autonomous Heating and Cooling Systems
    Andrew Perrault
    Proceedings of the Eighteenth International Conference on Autonomous Agents and Multiagent Systems (AAMAS-19), Montreal (2019), pp. 431-439
    Preview abstract AI systems that act on behalf of users require knowledge of user preferences, which can be acquired by preference elicitation. In many settings, users can respond more easily and accurately to preference queries reflecting their current, or recently experienced, context (e.g., state of the environment), than to those reflecting contexts further removed. We develop and study a formal model of experiential elicitation (EE) in which query costs and response noise are state-dependent. EE settings tightly couple the problems of control and elicitation. We provide some analysis of this abstract model, and illustrate its applicability in household heating/cooling management. We propose the use of relative value queries, asking the user to compare the immediate utility of two states, whose difficulty is related to the degree and recency of a user’s experience with those states. We develop a Gaussian process-based approach for modeling user preferences in dynamic EE domains and show that it accrues higher reward than several natural baselines. View details
    Perturbed-History Exploration in Stochastic Linear Bandits
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of the Thirty-fifth Conference on Uncertainty in Artificial Intelligence (UAI-19), Tel Aviv, Israel (2019), pp. 176-186
    Preview abstract We propose a new online algorithm for cumulative regret minimization in a stochastic linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history. Therefore, we call it perturbed-history exploration in a linear bandit (LinPHE). The perturbed history is a mixture of observed rewards and randomly generated i.i.d. pseudo-rewards. We derive a $\tilde{O}(d \sqrt{n})$ gap-free bound on the $n$-round regret of LinPHE, where $d$ is the number of features. The key steps in our analysis are new concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we generalize LinPHE to a logistic model. We evaluate our algorithms empirically and show that they are practical. View details
    Preview abstract We propose RecSim, a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects of user behavior and item structure at a level of abstraction well-suited to pushing the limits of current reinforcement learning (RL) and RS techniques in sequential interactive recommendation problems. Environments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior. We outline how RecSim offers value to RL and RS researchers and practitioners, and how it can serve as a vehicle for academic-industrial collaboration. View details
    Empathetic Decision Making in Social Networks
    Amirali Salehi-Abari
    Kate Larson
    Artificial Intelligence, vol. 275 (2019), pp. 174-203
    Preview abstract Social networks play a central role in the transactions and decision making of individuals by correlating the behaviors and preferences of connected agents. We introduce a notion of empathy in social networks, in which individuals derive utility based on both their own intrinsic preferences, and empathetic preferences determined by the satisfaction of their neighbors in the network. After theoretically analyzing the properties of our empathetic framework, we study the problem of group recommendation, or consensus decision making, within this framework. We show how this problem translates into a weighted form of classical preference aggregation (e.g., social welfare maximization or certain forms of voting), and develop scalable optimization algorithms for this task. Furthermore, we show that our framework can be generalized to encompass other multiagent systems problems, such as constrained resource allocation, and provide scalable iterative algorithms for these generalizations. Our empirical experiments demonstrate the value of accounting for empathetic preferences in group decisions, and the tractability of our algorithms. View details
    Perturbed-History Exploration in Stochastic Multi-Armed Bandits
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2786-2793
    Preview abstract We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines. View details
    Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Morgane Lustman
    Vince Gatto
    Paul Covington
    Jim McFadden
    arXiv (2019)
    Preview abstract Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user-choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. (ii) We outline a methodology that leverages existing myopic learning-based recommenders to quickly develop a recommender that handles LTV. (iii) We demonstrate our methods in simulation, and validate the scalability of decomposed TD-learning using SlateQ in live experiments on YouTube. View details
    Data Center Cooling using Model-predictive Control
    Tyler Lu
    MK Ryu
    Eehern Jay Wong
    Binz Roy
    Greg Imwalle
    Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS-18), Montreal, QC (2018), pp. 3818-3827
    Preview abstract Despite the impressive advances in reinforcement learning (RL) algorithms, their deployment to real-world physical systems is often complicated by unexpected events and the potential for expensive failures. In this paper we describe an application of RL “in the wild” to the task of regulating temperatures and airflow inside a large-scale data center (DC). Adopting a data-driven model-based approach, we demonstrate that an RL agent is able to effectively and safely regulate conditions inside a server floor in just a few hours, while improving operational efficiency relative to existing controllers. View details
    Planning and Learning with Stochastic Action Sets
    Yishay Mansour
    Ofer Meshi
    Martin Mladenov
    Dale Schuurmans
    Proceedings of the Twenty-seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm (2018), pp. 4674-4682
    Preview abstract This is an extended version of the paper Planning and Learning with Stochastic Action Sets that appeared in the Proceedings of the Twenty-seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp.4674-4682, Stockholm (2018). In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution. View details
    Non-delusional Q-learning and value-iteration
    Tyler Lu
    Dale Schuurmans
    Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS-18), Montreal, QC (2018), pp. 9971-9981
    Preview abstract We identify a fundamental source of error in Q-learning and other forms of dynamic programming with function approximation. Delusional bias arises when the approximation architecture limits the class of expressible greedy policies. Since standard Q-updates make globally uncoordinated action choices with respect to the expressible policy class, inconsistent or even conflicting Q-value estimates can result, leading to pathological behaviour such as over/under-estimation, instability and even divergence. To solve this problem, we introduce a new notion of policy consistency and define a local backup process that ensures global consistency through the use of information sets---sets that record constraints on policies consistent with backed-up Q-values. We prove that both the model-based and model-free algorithms using this backup remove delusional bias, yielding the first known algorithms that guarantee optimal results under general conditions. These algorithms furthermore only require polynomially many information sets (from a potentially exponential support). Finally, we suggest other practical heuristics for value-iteration and Q-learning that attempt to reduce delusional bias. View details
    Approximate Linear Programming for Logistic Markov Decision Processes
    Martin Mladenov
    Dale Schuurmans
    Ofer Meshi
    Tyler Lu
    Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia (2017), pp. 2486-2493
    Preview abstract This is an extended version of the paper Logistic Markov Decision Processes that appeared in the Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp.2486-2493, Melbourne (2017). Online and mobile interactions with users, in areas such as advertising and product or content recommendation, have been transformed by machine learning techniques. However, such methods have largely focused on myopic prediction, i.e., predicting immediate user response to system actions (e.g., ads or recommendations), without explicitly accounting for the long-term impact on user behavior, nor the potential need for planning action sequences. In this work, we propose the use of Markov decision processes (MDPs) to formulate the long-term decision problem and address two key questions that emerge in their application to user interaction. The first focuses on model formulation, specifically, how best to construct MDP models of user interaction in a way that exploits the great successes of myopic prediction models. To this end, we propose a new model called logistic MDPs, an MDP formulation that allows the concise specification of transition dynamics. It does so by augmenting the natural factored form of dynamic Bayesian networks (DBNs) with user response variables that are captured by a logistic regression model (the latter being precisely the model used for myopic user interaction). The second question we address is how best to solve large logistic MDPs of this type. A variety of methods have been proposed for solving MDPs that exploit the conditional independence reflected in the DBN representations, including approximate linear programming (ALP). Despite their compact form, logistic MDPs do not admit the same conditional independence as DBNs, nor do they satisfy the linearity requirements for standard ALP. We propose a constraint generation approach to ALP for logistic MDPs that circumvents these problems by: (a) recovering compactness by conditioning on the logistic response variable; and (b) devising two procedures, one exact and one approximate, that linearize the search for violated constraints in the master LP. For the approximation procedure, we also derive error bounds on the quality of the induced policy. We demonstrate the effectiveness of our approach on advertising problems with up to several thousand sparse binarized features (up to 2^54 and 2^39 actions). View details
    Multiple-Profile Prediction-of-Use Games
    Andrew Perrault
    Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia (2017), pp. 2486-2493
    Preview abstract Prediction-of-use (POU) games (Vinyals et al., 2017) address the mismatch between energy supplier costs and the incentives imposed on consumers by fixed-rate electricity tariffs. However, the framework does not address how consumers should coordinate to maximize social welfare. To address this, we develop MPOU games, an extension of POU games in which agents report multiple acceptable electricity use profiles. We show that MPOU games share many attractive properties with POU games attractive (e.g., convexity). Despite this, MPOU games introduce new incentive issues that prevent the consequences of convexity from being exploited directly, a problem we analyze and resolve. We validate our approach with experimental results using utility models learned from real electricity use data. Note: An extended version of this paper appears as Chapter 17 of Autonomous Agents and Multiagent Systems: AAMAS 2017 Workshops, Best Papers, São Paulo, Brazil, Revised Selected Papers, pp.275-295, Springer Lecture Notes in AI, 2017. View details
    Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes
    Tyler Lu
    Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI-16), New York, NY (2016), pp. 52-61
    Preview abstract This is an extended version of a paper by the same title that appeared in the Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI-16), pp.52--61, New York, 2016. We consider the problem of budget (or other resource) allocation in sequential decision problems involving a large number of concurrently running sub-processes, whose only interaction is through their gradual consumption of budget (or the resource in question). We use the case of an advertiser interacting with a large population of target customers as a primary motivation. We consider two complementary problems that comprise our key contributions. Our first contribution addresses the problem of computing MDP value functions as a function of the available budget. In contrast to standard constrained MDPs—which find optimal value functions given a fixed expected budget constraint—our aim is to assess the tradeoff between expected budget spent and the value attained when using that budget optimally. We show that optimal value functions are concave in budget. More importantly, in the finite-horizon case, we show there are a finite number of useful budget levels. This gives rise to piecewise-linear, concave value functions (piecewise-constant if we restrict to deterministic policies) with an representation that can be computed readily via dynamic programming. This representation also supports natural approximations. Our model not only allows the assessment of budget/value tradeoffs (e.g., to find the “sweet spot” in spend), but plays an important role in the allocation of budget across competing subprocesses. Our second contribution is a method for constructing a policy that prescribes the joint policy to be taken across all sub-processes given the joint state of the system, subject to a global budget constraint. We cast the problem as a weakly coupled MDP in which budget is allocated online to the individual subprocesses based on its observed (initial) state and the subprocess-specific value function. We show that the budget allocation process can be cast as a multi-item, multiple-choice knapsack problem (MCKP), which admits an efficient greedy algorithm to determine optimal allocations. We also discuss the possibility of online, per-stage re-allocation of budget to adaptively satisfy strict rather than expected budget constraints. View details
    Incomplete Information and Communication in Voting
    Jeff Rosenschein
    Handbook of Computational Social Choice, Cambridge University Press (2016), pp. 223-260
    Optimal social choice functions: A utilitarian view
    Ioannis Caragiannis
    Simi Haber
    Tyler Lu
    Ariel D. Procaccia
    Or Sheffet
    Artif. Intell., vol. 227 (2015), pp. 190-213
    Approximately Stable Pricing for Coordinated Purchasing of Electricity
    Andrew Perrault
    IJCAI (2015), pp. 2624-2631
    Approximately Strategy-proof Mechanisms for (Constrained) Facility Location
    Xin Sui
    AAMAS (2015), pp. 605-613
    Preference-oriented Social Networks: Group Recommendation and Inference
    Amirali Salehi-Abari
    RecSys (2015), pp. 35-42
    Value-Directed Compression of Large-Scale Assignment Problems
    Tyler Lu
    AAAI (2015), pp. 1182-1190
    The Pricing War Continues: On Competitive Multi-Item Pricing
    Omer Lev
    Joel Oren
    Jeffrey S. Rosenschein
    AAAI (2015), pp. 972-978
    Optimal Group Manipulation in Facility Location Problems
    Xin Sui
    ADT (2015), pp. 505-520
    The Pricing War Continues: On Competitive Multi-Item Pricing
    Omer Lev
    Joel Oren
    Jeffrey S. Rosenschein
    CoRR, vol. abs/1408.0258 (2014)
    Efficient coordinated power distribution on private infrastructure
    Andrew Perrault
    AAMAS (2014), pp. 805-812
    Empathetic social choice on social networks
    Amirali Salehi-Abari
    AAMAS (2014), pp. 693-700
    Regret-Based Optimization and Preference Elicitation for Stackelberg Security Games with Uncertainty
    Thanh Hong Nguyen
    Amulya Yadav
    Bo An
    Milind Tambe
    AAAI (2014), pp. 756-762
    Effective sampling and learning for Mallows models with pairwise-preference data
    Tyler Lu
    Journal of Machine Learning Research, vol. 15 (2014), pp. 3783-3829
    A Game-Theoretic Analysis of Catalog Optimization
    Joel Oren
    Nina Narodytska
    AAAI (2014), pp. 1463-1470
    Robust Winners and Winner Determination Policies under Candidate Uncertainty
    Jérôme Lang
    Joel Oren
    Héctor Palacios
    AAAI (2014), pp. 1391-1397
    On the value of using group discounts under price competition
    Reshef Meir
    Tyler Lu
    Moshe Tennenholtz
    Artif. Intell., vol. 216 (2014), pp. 163-178
    Preference Elicitation and Interview Minimization in Stable Matchings
    Joanna Drummond
    AAAI (2014), pp. 645-653
    Vector-space Analysis of Belief-state Approximation for POMDPs
    Pascal Poupart
    CoRR, vol. abs/1301.2304 (2013)
    Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (2000)
    Moisés Goldszmidt
    CoRR, vol. abs/1304.3842 (2013)
    Integrating Planning and Execution in Stochastic Domains
    Richard Dearden
    CoRR, vol. abs/1302.6799 (2013)
    SPUDD: Stochastic Planning using Decision Diagrams
    Jesse Hoey
    Robert St-Aubin
    Alan J. Hu
    CoRR, vol. abs/1301.6704 (2013)
    Modal Logics for Qualitative Possibility and Beliefs
    CoRR, vol. abs/1303.5393 (2013)
    Multi-Winner Social Choice with Incomplete Preferences
    Tyler Lu
    IJCAI (2013)
    Approximately Optimal Monitoring of Plan Preconditions
    CoRR, vol. abs/1301.3839 (2013)
    Structured Reachability Analysis for Markov Decision Processes
    Ronen I. Brafman
    Christopher W. Geib
    CoRR, vol. abs/1301.7361 (2013)
    Analysis and Optimization of Multi-Dimensional Percentile Mechanisms
    Xin Sui
    Tuomas Sandholm
    IJCAI (2013)
    On the Value of Using Group Discounts under Price Competition
    Reshef Meir
    Tyler Lu
    Moshe Tennenholtz
    AAAI (2013)
    Structured Arc Reversal and Simulation of Dynamic Probabilistic Networks
    Adrian Y. W. Cheuk
    CoRR, vol. abs/1302.1527 (2013)
    Reasoning With Conditional Ceteris Paribus Preference Statem
    Ronen I. Brafman
    Holger H. Hoos
    David Poole
    CoRR, vol. abs/1301.6681 (2013)
    Continuous Value Function Approximation for Sequential Bidding Policies
    Moisés Goldszmidt
    Bikash Sabata
    CoRR, vol. abs/1301.6682 (2013)
    Value-Directed Sampling Methods for POMDPs
    Pascal Poupart
    Luis E. Ortiz
    CoRR, vol. abs/1301.2305 (2013)
    Context-Specific Independence in Bayesian Networks
    Nir Friedman
    Moisés Goldszmidt
    Daphne Koller
    CoRR, vol. abs/1302.3562 (2013)
    Efficient Vote Elicitation under Candidate Uncertainty
    Joel Oren
    Yuval Filmus
    IJCAI (2013)
    Learning Conventions in Multiagent Stochastic Domains using Likelihood Estimates
    CoRR, vol. abs/1302.3561 (2013)
    Multi-Dimensional Single-Peaked Consistency and Its Approximations
    Xin Sui
    Alex Francois-Nienaber
    IJCAI (2013)
    Value-Directed Belief State Approximation for POMDPs
    Pascal Poupart
    CoRR, vol. abs/1301.3887 (2013)
    The Probability of a Possibility: Adding Uncertainty to Default Rules
    CoRR, vol. abs/1303.1509 (2013)
    Elicitation and Approximately Stable Matching with Partial Preferences
    Joanna Drummond
    IJCAI (2013)
    Correlated Action Effects in Decision Theoretic Regression
    CoRR, vol. abs/1302.1522 (2013)
    UCP-Networks: A Directed Graphical Representation of Conditional Utilities
    Fahiem Bacchus
    Ronen I. Brafman
    CoRR, vol. abs/1301.2259 (2013)
    People, sensors, decisions: Customizable and adaptive technologies for assistance in healthcare
    Jesse Hoey
    Pascal Poupart
    Patrick Olivier
    Andrew Monk
    Alex Mihailidis
    TiiS, vol. 2 (2012), pp. 20
    Active Learning for Matching Problems
    Laurent Charlin
    Richard S. Zemel
    CoRR, vol. abs/1206.4647 (2012)
    Optimal social choice functions: a utilitarian view
    Ioannis Caragiannis
    Simi Haber
    Tyler Lu
    Ariel D. Procaccia
    Or Sheffet
    EC (2012), pp. 197-214
    Minimax regret based elicitation of generalized additive utilities
    Darius Braziunas
    CoRR, vol. abs/1206.5255 (2012)
    Sequentially optimal repeated coalition formation under uncertainty
    Georgios Chalkiadakis
    Autonomous Agents and Multi-Agent Systems, vol. 24 (2012), pp. 441-484
    Eliciting forecasts from self-interested experts: scoring rules for decision makers
    AAMAS (2012), pp. 737-744
    Local Utility Elicitation in GAI Models
    Darius Braziunas
    CoRR, vol. abs/1207.1361 (2012)
    Active Collaborative Filtering
    Richard S. Zemel
    Benjamin M. Marlin
    CoRR, vol. abs/1212.2442 (2012)
    Regret-based Reward Elicitation for Markov Decision Processes
    CoRR, vol. abs/1205.2619 (2012)
    Bayesian Vote Manipulation: Optimal Strategies and Impact on Welfare
    Tyler Lu
    Pingzhong Tang
    Ariel D. Procaccia
    UAI (2012), pp. 543-553
    Matching models for preference-sensitive group purchasing
    Tyler Lu
    EC (2012), pp. 723-740
    Cooperative Negotiation in Autonomic Systems using Incremental Utility Elicitation
    Rajarshi Das
    Jeffrey O. Kephart
    Gerald Tesauro
    William E. Walsh
    CoRR, vol. abs/1212.2443 (2012)
    Regret Minimizing Equilibria and Mechanisms for Games with Strict Type Uncertainty
    Nathanael Hyafil
    CoRR, vol. abs/1207.4147 (2012)
    Active Learning for Matching Problems
    Laurent Charlin
    Richard S. Zemel
    ICML (2012)
    Toward Experiential Utility Elicitation for Interface Customization
    Bowen Hui
    CoRR, vol. abs/1206.3258 (2012)
    A Dynamic Rationalization of Distance Rationalizability
    Ariel D. Procaccia
    AAAI (2012)
    Approximate Linear Programming for First-order MDPs
    Scott Sanner
    CoRR, vol. abs/1207.1415 (2012)
    Bayesian Vote Manipulation: Optimal Strategies and Impact on Welfare
    Tyler Lu
    Pingzhong Tang
    Ariel D. Procaccia
    CoRR, vol. abs/1210.4895 (2012)
    A Framework for Optimizing Paper Matching
    Laurent Charlin
    Richard S. Zemel
    CoRR, vol. abs/1202.3706 (2012)
    Partial-Order Planning with Concurrent Interacting Actions
    Ronen I. Brafman
    CoRR, vol. abs/1106.0249 (2011)
    A Bayesian Concept Learning Approach to Crowdsourcing
    Paolo Viappiani
    Sandra Zilles
    Howard J. Hamilton
    Interactive Decision Theory and Game Theory (2011)
    Preference Elicitation and Preference Learning in Social Choice
    CPAIOR (2011), pp. 1
    Recommendation Sets and Choice Queries: There Is No Exploration/Exploitation Tradeoff!
    Paolo Viappiani
    AAAI (2011)
    Budgeted Social Choice: From Consensus to Personalized Decision Making
    Tyler Lu
    IJCAI (2011), pp. 280-286
    Accelerating Reinforcement Learning through Implicit Imitation
    Bob Price
    CoRR, vol. abs/1106.0681 (2011)
    CP-nets: A Tool for Representing and Reasoning withConditional Ceteris Paribus Preference Statements
    Ronen I. Brafman
    Carmel Domshlak
    Holger H. Hoos
    David Poole
    CoRR, vol. abs/1107.0023 (2011)
    Eliciting Forecasts from Self-interested Experts: Scoring Rules for Decision Makers
    CoRR, vol. abs/1106.2489 (2011)
    Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs
    Tyler Lu
    ADT (2011), pp. 135-149
    Efficiency and Privacy Tradeoffs in Mechanism Design
    Xin Sui
    AAAI (2011)
    Learning Complex Concepts Using Crowdsourcing: A Bayesian Approach
    Paolo Viappiani
    Sandra Zilles
    Howard J. Hamilton
    ADT (2011), pp. 277-291
    A Framework for Optimizing Paper Matching
    Laurent Charlin
    Richard S. Zemel
    UAI (2011), pp. 86-95
    Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
    Thomas L. Dean
    Steve Hanks
    CoRR, vol. abs/1105.5460 (2011)
    Eliciting Additive Reward Functions for Markov Decision Processes
    IJCAI (2011), pp. 2159-2164
    Robust Approximation and Incremental Elicitation in Voting Protocols
    Tyler Lu
    IJCAI (2011), pp. 287-293
    Robust Online Optimization of Reward-Uncertain MDPs
    IJCAI (2011), pp. 2165-2171
    Learning Mallows Models with Pairwise Preferences
    Tyler Lu
    ICML (2011), pp. 145-152
    The unavailable candidate model: a decision-theoretic view of social choice
    Tyler Lu
    EC (2010), pp. 263-274
    Automated Channel Abstraction for Advertising Auctions
    William E. Walsh
    Tuomas Sandholm
    Rob Shields
    George L. Nemhauser
    David C. Parkes
    AAAI (2010)
    Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process
    Jesse Hoey
    Pascal Poupart
    Axel von Bertoldi
    Tammy Craig
    Alex Mihailidis
    Computer Vision and Image Understanding, vol. 114 (2010), pp. 503-519
    Simultaneous Elicitation of Preference Features and Utility
    Paolo Viappiani
    AAAI (2010)
    Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
    Paolo Viappiani
    NIPS (2010), pp. 2352-2360
    Assessing regret-based preference elicitation with the UTPREF recommendation system
    Darius Braziunas
    EC (2010), pp. 219-228
    Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies
    Practical solution techniques for first-order MDPs
    Scott Sanner
    Artif. Intell., vol. 173 (2009), pp. 748-788
    Optimal Set Recommendations Based on Regret
    Paolo Viappiani
    ITWP (2009)
    A probabilistic mental model for estimating disruption
    Bowen Hui
    Grant A. Partridge
    IUI (2009), pp. 287-296
    Preference elicitation with subjective features
    Paolo Viappiani
    RecSys (2009), pp. 341-344
    Regret-based Reward Elicitation for Markov Decision Processes
    UAI (2009), pp. 444-451
    Online feature elicitation in interactive optimization
    Paolo Viappiani
    ICML (2009), pp. 73-80
    Regret-based optimal recommendation sets in conversational recommender systems
    Paolo Viappiani
    RecSys (2009), pp. 101-108
    Intelligent Decision Support in Medicine: Back to Bayes?
    Gitte Lindgaard
    Peter Egan
    Colin N. Jones
    Catherine Pyper
    Monique Frize
    Robin C. Walker
    Bowen Hui
    Sheila Narasimhan
    Janette Folkens
    Bill Winogron
    J. UCS, vol. 14 (2008), pp. 2720-2736
    Sequential decision making in repeated coalition formation under uncertainty
    Georgios Chalkiadakis
    AAMAS (1) (2008), pp. 347-354
    Computing Reserve Prices and Identifying the Value Distribution in Real-world Auctions with Market Disruptions
    William E. Walsh
    David C. Parkes
    Tuomas Sandholm
    AAAI (2008), pp. 1499-1502
    Toward Experiential Utility Elicitation for Interface Customization
    Bowen Hui
    UAI (2008), pp. 298-305
    Elicitation of Factored Utilities
    Darius Braziunas
    AI Magazine, vol. 29 (2008), pp. 79-92
    The need for an interaction cost model in adaptive interfaces
    Bowen Hui
    Sean Gustafson
    Pourang Irani
    AVI (2008), pp. 458-461
    Expressive Banner Ad Auctions and Model-Based Online Optimization for Clearing
    David C. Parkes
    Tuomas Sandholm
    William E. Walsh
    AAAI (2008), pp. 30-37
    Mechanism Design with Partial Revelation
    Nathanael Hyafil
    IJCAI (2007), pp. 1333-1340
    Computing Optimal Subsets
    Maxim Binshtok
    Ronen I. Brafman
    Solomon Eyal Shimony
    Ajay Mani
    AAAI (2007), pp. 1231-1236
    Automated Design of Multistage Mechanisms
    Tuomas Sandholm
    Vincent Conitzer
    IJCAI (2007), pp. 1500-1506
    Partial Revelation Automated Mechanism Design
    Nathanael Hyafil
    AAAI (2007), pp. 72-78
    Minimax regret based elicitation of generalized additive utilities
    Darius Braziunas
    UAI (2007), pp. 25-32
    Coalitional Bargaining with Agent Type Uncertainty
    Georgios Chalkiadakis
    IJCAI (2007), pp. 1227-1232
    Coalition formation under uncertainty: bargaining equilibria and the Bayesian core stability concept
    Georgios Chalkiadakis
    Evangelos Markakis
    AAMAS (2007), pp. 64
    Approximate Solution Techniques for Factored First-Order MDPs
    Scott Sanner
    ICAPS (2007), pp. 288-295
    Practical Linear Value-approximation Techniques for First-order MDPs
    Scott Sanner
    UAI (2006)
    Regret-based Incremental Partial Revelation Mechanisms
    Nathanael Hyafil
    AAAI (2006), pp. 672-678
    Who's asking for help?: a Bayesian approach to intelligent assistance
    Bowen Hui
    IUI (2006), pp. 186-193
    Constraint-based optimization and utility elicitation using the minimax decision criterion
    Relu Patrascu
    Pascal Poupart
    Dale Schuurmans
    Artif. Intell., vol. 170 (2006), pp. 686-713
    Preference Elicitation and Generalized Additive Utility
    Darius Braziunas
    AAAI (2006), pp. 1573-1576
    A Planning System Based on Markov Decision Processes to Guide People With Dementia Through Activities of Daily Living
    Jennifer Boger
    Jesse Hoey
    Pascal Poupart
    Geoff Fernie
    Alex Mihailidis
    IEEE Transactions on Information Technology in Biomedicine, vol. 10 (2006), pp. 323-333
    Approximate Linear Programming for First-order MDPs
    Scott Sanner
    UAI (2005), pp. 509-517
    A Decision-Theoretic Approach to Task Assistance for Persons with Dementia
    Jennifer Boger
    Pascal Poupart
    Jesse Hoey
    Geoff Fernie
    Alex Mihailidis
    IJCAI (2005), pp. 1293-1299
    The Influence of Influence Diagrams on Artificial Intelligence
    Decision Analysis, vol. 2 (2005), pp. 229-231
    Regret-based Utility Elicitation in Constraint-based Decision Problems
    Relu Patrascu
    Pascal Poupart
    Dale Schuurmans
    IJCAI (2005), pp. 929-934
    Local Utility Elicitation in GAI Models
    Darius Braziunas
    UAI (2005), pp. 42-49
    New Approaches to Optimization and Utility Elicitation in Autonomic Computing
    Relu Patrascu
    Rajarshi Das
    Jeffrey O. Kephart
    Gerald Tesauro
    William E. Walsh
    AAAI (2005), pp. 140-145
    Bayesian Reinforcement Learning for Coalition Formation under Uncertainty
    Georgios Chalkiadakis
    AAMAS (2004), pp. 1090-1097
    Regret Minimizing Equilibria and Mechanisms for Games with Strict Type Uncertainty
    Nathanael Hyafil
    UAI (2004), pp. 268-277
    CP-nets: A Tool for Representing and Reasoning with Conditional Ceteris Paribus Preference Statements
    Ronen I. Brafman
    Carmel Domshlak
    Holger H. Hoos
    David Poole
    J. Artif. Intell. Res. (JAIR), vol. 21 (2004), pp. 135-191
    Preference-Based Constrained Optimization with CP-Nets
    Ronen I. Brafman
    Carmel Domshlak
    Holger H. Hoos
    David Poole
    Computational Intelligence, vol. 20 (2004), pp. 137-157
    Eliciting Bid Taker Non-price Preferences in (Combinatorial) Auctions
    Tuomas Sandholm
    Rob Shields
    AAAI (2004), pp. 204-211
    A Study of Limited-Precision, Incremental Elicitation in Auctions
    Alexander Kress
    AAMAS (2004), pp. 1344-1345
    VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
    Pascal Poupart
    NIPS (2004), pp. 1081-1088
    Stochastic Local Search for POMDP Controllers
    Darius Braziunas
    AAAI (2004), pp. 690-696
    Incremental Utility Elicitation with the Minimax Regret Decision Criterion
    Tianhan Wang
    IJCAI (2003), pp. 309-318
    Coordination in multiagent reinforcement learning: a Bayesian approach
    Georgios Chalkiadakis
    AAMAS (2003), pp. 709-716
    Accelerating Reinforcement Learning through Implicit Imitation
    Bob Price
    J. Artif. Intell. Res. (JAIR), vol. 19 (2003), pp. 569-629
    Towards Cooperative Negotiation for Decentralized Resource Allocation in Autonomic Computing Systems
    Rajarshi Das
    Jeffrey O. Kephart
    William E. Walsh
    IJCAI (2003), pp. 1458-1459
    Constraint-Based Optimization with the Minimax Decision Criterion
    Relu Patrascu
    Pascal Poupart
    Dale Schuurmans
    CP (2003), pp. 168-182
    Active Collaborative Filtering
    Richard S. Zemel
    Benjamin M. Marlin
    UAI (2003), pp. 98-106
    Cooperative Negotiation in Autonomic Systems using Incremental Utility Elicitation
    Rajarshi Das
    Jeffrey O. Kephart
    Gerald Tesauro
    William E. Walsh
    UAI (2003), pp. 89-97
    On the Foundations of Expected Expected Utility
    IJCAI (2003), pp. 285-290
    A Bayesian Approach to Imitation in Reinforcement Learning
    Bob Price
    IJCAI (2003), pp. 712-720
    Bounded Finite State Controllers
    Pascal Poupart
    NIPS (2003), pp. 823-830
    An Active Approach to Collaborative Filtering
    Richard S. Zemel
    AISTATS (2003)
    Solving Concisely Expressed Combinatorial Auction Problems
    AAAI/IAAI (2002), pp. 359-366
    Value-Directed Compression of POMDPs
    Pascal Poupart
    NIPS (2002), pp. 1547-1554
    A POMDP Formulation of Preference Elicitation Problems
    AAAI/IAAI (2002), pp. 239-246
    Piecewise Linear Value Function Approximation for Factored MDPs
    Pascal Poupart
    Relu Patrascu
    Dale Schuurmans
    AAAI/IAAI (2002), pp. 292-299
    Greedy Linear Value-Approximation for Factored Markov Decision Processes
    Relu Patrascu
    Pascal Poupart
    Dale Schuurmans
    Carlos Guestrin
    AAAI/IAAI (2002), pp. 285-291
    Symbolic Dynamic Programming for First-Order MDPs
    Raymond Reiter
    Bob Price
    IJCAI (2001), pp. 690-700
    Value-Directed Sampling Methods for POMDPs
    Pascal Poupart
    Luis E. Ortiz
    UAI (2001), pp. 453-461
    Partial-Order Planning with Concurrent Interacting Actions
    Ronen I. Brafman
    J. Artif. Intell. Res. (JAIR), vol. 14 (2001), pp. 105-136
    Vector-space Analysis of Belief-state Approximation for POMDPs
    Pascal Poupart
    UAI (2001), pp. 445-452
    Bidding Languages for Combinatorial Auctions
    Holger H. Hoos
    IJCAI (2001), pp. 1211-1217
    UCP-Networks: A Directed Graphical Representation of Conditional Utilities
    Fahiem Bacchus
    Ronen I. Brafman
    UAI (2001), pp. 56-64
    Imitation and Reinforcement Learning in Agents with Heterogeneous Actions
    Bob Price
    Canadian Conference on AI (2001), pp. 111-120
    Stochastic dynamic programming with factored representations
    Richard Dearden
    Moisés Goldszmidt
    Artif. Intell., vol. 121 (2000), pp. 49-107
    Approximately Optimal Monitoring of Plan Preconditions
    UAI (2000), pp. 54-62
    Value-Directed Belief State Approximation for POMDPs
    Pascal Poupart
    UAI (2000), pp. 497-506
    Decision-Theoretic, High-Level Agent Programming in the Situation Calculus
    Raymond Reiter
    Mikhail Soutchanski
    Sebastian Thrun
    AAAI/IAAI (2000), pp. 355-362
    APRICODD: Approximate Policy Construction Using Decision Diagrams
    Robert St-Aubin
    Jesse Hoey
    NIPS (2000), pp. 1089-1095
    Decision Making under Uncertainty: Operations Research Meets AI (Again)
    AAAI/IAAI (2000), pp. 1145-1150
    Solving Combinatorial Auctions Using Stochastic Local Search
    Holger H. Hoos
    AAAI/IAAI (2000), pp. 22-29
    Implicit Imitation in Multiagent Reinforcement Learning
    Bob Price
    ICML (1999), pp. 325-334
    Reasoning With Conditional Ceteris Paribus Preference Statements
    Ronen I. Brafman
    Holger H. Hoos
    David Poole
    UAI (1999), pp. 71-80
    Resource Allocation Using Sequential Auctions
    Moisés Goldszmidt
    Claire Monteleoni
    Bikash Sabata
    Agent Mediated Electronic Commerce (IJCAI Workshop) (1999), pp. 131-152
    Sequential Auctions for the Allocation of Resources with Complementarities
    Moisés Goldszmidt
    Bikash Sabata
    IJCAI (1999), pp. 527-523
    Decision Theoretic Planning: Structural Assumptions and Computational Leverage
    Thomas Dean
    Steve Hanks
    Journal of Artificial Intelligence Research, vol. 11 (1999), pp. 1-94
    SPUDD: Stochastic Planning using Decision Diagrams
    Jesse Hoey
    Robert St-Aubin
    Alan J. Hu
    UAI (1999), pp. 279-288
    Sequential Optimality and Coordination in Multiagent Systems
    IJCAI (1999), pp. 478-485
    Continuous Value Function Approximation for Sequential Bidding Policies
    Moisés Goldszmidt
    Bikash Sabata
    UAI (1999), pp. 81-90
    Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
    Thomas L. Dean
    Steve Hanks
    J. Artif. Intell. Res. (JAIR), vol. 11 (1999), pp. 1-94
    Knowledge Representation for Stochastic Decision Process
    Artificial Intelligence Today (1999), pp. 111-152
    Multiagent Systems: Challenges and Opportunities for Decision-Theoretic Planning
    AI Magazine, vol. 20 (1999), pp. 35-43
    The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems
    Caroline Claus
    AAAI/IAAI (1998), pp. 746-752
    Structured Reachability Analysis for Markov Decision Processes
    Ronen I. Brafman
    Christopher W. Geib
    UAI (1998), pp. 24-32
    A Unified Model of Qualitative Belief Change: A Dynamical Systems Perspective
    Artif. Intell., vol. 98 (1998), pp. 281-316
    Hierarchical Solution of Markov Decision Processes using Macro-actions
    Milos Hauskrecht
    Nicolas Meuleau
    Leslie Pack Kaelbling
    Thomas Dean
    UAI (1998), pp. 220-229
    Solving Very Large Weakly Coupled Markov Decision Processes
    Nicolas Meuleau
    Milos Hauskrecht
    Kee-Eung Kim
    Leonid Peshkin
    Leslie Pack Kaelbling
    Thomas Dean
    AAAI/IAAI (1998), pp. 165-172
    Hierarchical solution of Markov decision processes using macro-actions
    Milos Hauskrecht
    Nicolas Meuleau
    Leslie Pack Kaelbling
    Thomas Dean
    Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), Morgan Kaufmann Publishers, San Francisco, California (1998), pp. 220-229
    Solving very large weakly coupled Markov decision processes
    Nicolas Meuleau
    Milos Hauskrecht
    Leslie Kaelbling
    Kee-Eung Kim
    Leonid Peshkin
    Thomas Dean
    Proceedings AAAI-98, MIT Press, Cambridge, Massachusetts (1998), pp. 165-172
    Belief Revision with Unreliable Observations
    Nir Friedman
    Joseph Y. Halpern
    AAAI/IAAI (1998), pp. 127-134
    Structured Arc Reversal and Simulation of Dynamic Probabilistic Networks
    Adrian Y. W. Cheuk
    UAI (1997), pp. 72-79
    Planning with Concurrent Interacting Actions
    Ronen I. Brafman
    AAAI/IAAI (1997), pp. 720-726
    Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning
    Ronen I. Brafman
    Christopher W. Geib
    IJCAI (1997), pp. 1156-1162
    Correlated Action Effects in Decision Theoretic Regression
    UAI (1997), pp. 30-37
    Structured Solution Methods for Non-Markovian Decision Processes
    Fahiem Bacchus
    Adam J. Grove
    AAAI/IAAI (1997), pp. 112-117
    Economic Principles of Multi-Agent Systems
    Yoav Shoham
    Michael P. Wellman
    Artif. Intell., vol. 94 (1997), pp. 1-6
    Abstraction and Approximate Decision-Theoretic Planning
    Richard Dearden
    Artif. Intell., vol. 89 (1997), pp. 219-283
    Context-Specific Independence in Bayesian Networks
    Nir Friedman
    Moisés Goldszmidt
    Daphne Koller
    UAI (1996), pp. 115-123
    Rewarding Behaviors
    Fahiem Bacchus
    Adam J. Grove
    AAAI/IAAI, Vol. 2 (1996), pp. 1160-1167
    Approximate Value Trees in Structured Dynamic Programming
    Richard Dearden
    ICML (1996), pp. 54-62
    Learning Conventions in Multiagent Stochastic Domains using Likelihood Estimates
    UAI (1996), pp. 106-114
    Planning, Learning and Coordination in Multiagent Decision Processes
    TARK (1996), pp. 195-210
    Abduction to Plausible Causes: An Event-Based model of Belief Update
    Artif. Intell., vol. 83 (1996), pp. 143-166
    Iterated revision and minimal change of conditional beliefs
    J. Philosophical Logic, vol. 25 (1996), pp. 263-305
    Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations
    David Poole
    AAAI/IAAI, Vol. 2 (1996), pp. 1168-1175
    The Frame Problem and Bayesian Network Action Representation
    Moisés Goldszmidt
    Canadian Conference on AI (1996), pp. 69-83
    Planning Under Uncertainty: Structural Assumptions and Computational Leverage
    Thomas Dean
    Steve Hanks
    Proceedings of the 3rd European Workshop on Planning (1995)
    On the Revision of Probabilistic Belief States
    Notre Dame Journal of Formal Logic, vol. 36 (1995), pp. 158-183
    Process-Oriented Planning and Average-Reward Optimality
    Martin L. Puterman
    IJCAI (1995), pp. 1096-1103
    Abduction as Belief Revision
    Verónica Becher
    Artif. Intell., vol. 77 (1995), pp. 43-94
    Exploiting Structure in Policy Construction
    Richard Dearden
    Moisés Goldszmidt
    IJCAI (1995), pp. 1104-1113
    Generalized Update: Belief Change in Dynamic Settings
    IJCAI (1995), pp. 1550-1556
    Conditional Logics of Normality: A Modal Approach
    Artif. Intell., vol. 68 (1994), pp. 87-154
    Integrating Planning and Execution in Stochastic Domains
    Richard Dearden
    UAI (1994), pp. 162-169
    Believing on the Basis of Qualitative Rules: Commentary on Kyburg
    Computational Intelligence, vol. 10 (1994), pp. 26-32
    Using Abstractions for Decision-Theoretic Planning with Time Constraints
    Richard Dearden
    AAAI (1994), pp. 1016-1022
    Toward a Logic for Qualitative Decision Theory
    KR (1994), pp. 75-86
    Unifying Default Reasoning and Belief Revision in a Modal Framework
    Artif. Intell., vol. 68 (1994), pp. 33-85
    Modal logics for qualitative possibility theory
    Int. J. Approx. Reasoning, vol. 10 (1994), pp. 173-201
    Revision Sequences and Nested Conditionals
    IJCAI (1993), pp. 519-525
    The Probability of a Possibility: Adding Uncertainty to Default Rules
    UAI (1993), pp. 461-468
    On the Semantics of Stable Inheritance Reasoning
    Computational Intelligence, vol. 9 (1993), pp. 73-110
    Revision by Conditional Beliefs
    Moisés Goldszmidt
    AAAI (1993), pp. 649-654
    Abduction As Belief Revision: A Model of Preferred Explanations
    Verónica Becher
    AAAI (1993), pp. 642-648
    Normative, Subjunctive and Autoepistemic Defaults
    ECAI Workshop on Knowledge Representation and Reasoning (1992), pp. 74-97
    Epistemic Entrenchment in autoepistemic logic
    Fundam. Inform., vol. 17 (1992), pp. 5-29
    A Logic for Revision and Subjunctive Queries
    AAAI (1992), pp. 609-615
    Modal Logics for Qualitative Possibility and Beliefs
    UAI (1992), pp. 17-24
    Normative, Subjunctive, and Autoepistemic Defaults: Adopting the Ramsey Test
    KR (1992), pp. 685-696
    Inaccessible Worlds and Irrelevance: Preliminary Report
    IJCAI (1991), pp. 413-418
    Conditional Logics of Normality as Modal Systems
    AAAI (1990), pp. 594-599
    A Semantical Approach to Stable Inheritance Reasoning
    IJCAI (1989), pp. 1134-1139