RMBoost: Reward Model Training With Preference-Conditional Multi-Aspect Synthetic Data Generation

Yennie Jun
Carl Yang
Michael Bendersky
Ran Xu
2025

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. The core idea of RMBoost is to first select a preference label and then directly generates the second more (or less) preferred response conditioned this preference label. Compared to traditional approaches where we first generate two responses and then obtain the preference label, RMBoost has two main advantages. First, RMBoost reduces labeling noise since preference pairs are constructed intentionally. Second, RMBoost allows for the creation of more diverse responses by incorporating various quality aspects (e.g., helpfulness, relevance, completeness) into the prompts We conduct extensive experiments on three diverse datasets and demonstrate that RMBoost outperforms other synthetic preference data generation techniques and significantly boosts the performance of five distinct reward models.
×