Eliciting User Preferences for Personalized Multi-Objective Reinforcement Learning through Comparative Feedback
Abstract
In classic reinforcement Learning (RL) problems, policies are evaluated with respect to some reward function and all optimal policies obtain the same expected return. However, when considering real-world dynamic environments in which different users have different preferences, a policy that is optimal for one user might sub-optimal for another.
In this work, we propose a multi-objective reinforcement learning framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons.
Our setup consists of a Markov Decision Process with a multi-objective reward function, in which each user corresponds to (unknown) personal preferences vector and their reward in each state-action is the inner product of their preference vector with the multi-objective reward at that state-action. Our goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models.
We first address the case where a user is provided with two policies and the user feedback is their preferred policy. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one.
In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.
In this work, we propose a multi-objective reinforcement learning framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons.
Our setup consists of a Markov Decision Process with a multi-objective reward function, in which each user corresponds to (unknown) personal preferences vector and their reward in each state-action is the inner product of their preference vector with the multi-objective reward at that state-action. Our goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models.
We first address the case where a user is provided with two policies and the user feedback is their preferred policy. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one.
In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.