Multi-turn Reinforcement Learning with Preference Human Feedback
Abstract
In this paper, we discuss the multi-turn preference based RL problem. We start by extending the regularized self-play Nash-MD formulation of the preference based RL to the general multi-turn case and show it converges to a Nash equilibrium in the online setting, where the transition and preferences model are known. We empirically test our algorithm on two environments: one where there is an explicit reward, and another in which only preference data is available without assuming any reward. Our experiment show that our algorithm is able to recover the same performance as as a direct reward-based RL algorithm, when a reward-signal is available, even when using the weaker preference signal. When only direct preference is available, our algorithm improves upon both the supervised and reward-based RLHF baselines.