Design considerations in Offline Preference-based RL
Abstract
We conduct a theoretical analysis of techniques for preference-based RL from offline datasets annotated with pairwise preferences, such as DPO. We identify key properties of the learning objective that influence the quality of the learned policy, such as the coverage of the offline dataset, the presence or absence of a normalizing baseline and the choice of loss function. Informed by the theory, we further conduct an empirical analysis of some key variants to corroborate our theoretical findings.