Most existing recommender systems primarily focus on the users (content consumers), matching users with the most relevant contents, with the goal of maximizing user satisfaction on the platform. However, given that content providers are playing an increasingly critical role through content creation, largely determining the content pool available for recommendation, a natural question that arises is: Can we design recommenders taking into account utilities of both users and content providers? By doing so, we hope to sustain the flourish of more content providers and a diverse content pool for long-term user satisfaction. Understanding the full impact of recommendations on both user and content provider groups is challenging. This paper aims to serve as a research investigation on one approach toward building a content provider-aware recommender, and evaluating its impact under a simulated setup.
To characterize the users-recommender-providers interdependence, we complement user modeling by formalizing provider dynamics as a parallel Markov Decision Process of partially observable states transited by recommender actions and user feedback. We then build a REINFORCE recommender agent, coined EcoAgent, to optimize a joint objective of user utility and the counterfactual utility lift of the content provider associated with the chosen content, which we show to be equivalent to maximizing overall user utility and utilities of all content providers on the platform. To evaluate our approach, we also introduce a simulation environment capturing the key interactions among users, providers, and the recommender. We offer a number of simulated experiments that shed light to both the benefits and the limitations of our approach. These results serve to understand how and when a content-provider aware recommender agent is of benefit in building multi-stakeholder recommender systems.View details
Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML-20), Vienna, Austria (2020)
Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.View details
Proceedings of the Twenty-ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan (2020), pp. 2824-2830
In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the- art performance in a number of tasks.View details
Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 3165-3172
Latent-state environments with long horizons, such as those faced by recommender systems, pose significant challenges for reinforcement learning (RL). In this work, we identify and analyze several key hurdles for RL in such environments, including belief state error and small action advantage. We develop a general principle called advantage amplification that can overcome these hurdles through the use of temporal abstraction. We propose several aggregation methods and prove they induce amplification in certain settings. We also bound the loss in optimality incurred by our methods in environments where latent state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task.View details
No Results Found
We're always looking for more talented, passionate people.