Fine-tuning Text-to-Image Diffusion Models via Reinforcement Learning from Human Feedback
Abstract
Despite significant progress in text-to-image synthesis, current models often produce images that do not align well with text prompts. To overcome this challenge, recent works have collected a large dataset of human feedback and trained a reward function that aligns with human evaluations. However, optimizing text-to-image models to maximize this reward function remains a challenging problem. In this work, we investigate reinforcement learning (RL) to fine-tune text-to-image models. Specifically, we define the fine-tuning task as an RL problem, tailored for diffusion models. We then update the pre-trained text-to-image diffusion models using a policy gradient algorithm to maximize the scores of the reward model, based on human feedback. We investigate several design choices, such as KL regularization, value learning, and balancing regularization coefficients, and find that careful consideration of these design choices is crucial for effective RL fine-tuning. Our experiments demonstrate that RL fine-tuning is more effective in improving pre-trained models than supervised fine-tuning, in terms of both alignment and fidelity.