DreamSync: Aligning Text-to-Image Generation with Image Understanding Models
Abstract
Text-to-Image (T2I) models still struggle to produce images that are both beautiful and faithful to the user's input text prompt. Recent frameworks to evaluate the faithfulness of T2I models, such as TIFA, have observed that large vision-language models (VLMs) can reliably analyze the generated images and measure the alignment to the text prompts. Building on this insight, we introduce DreamSync, a model-agnostic training algorithm that utilizes VLM feedback to improve T2I models. The main idea behind DreamSync is to bootstrap T2I models with their own generations. First, we use the T2I model to generate several candidate images. Then, we use two VLMs as data selectors: one is a Visual Question Answering (VQA) model that measures the alignment of generated images to user prompts, and the other measures the image aesthetic quality. After selecting the top candidate images, we use LoRA to iteratively fine-tune the T2I model. Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models, evidenced by multiple benchmarks (+1.77% on TIFA, +2.8% on DSG1K, +3.76% on VILA aesthetic) and human evaluations. DreamSync does not need any additional human annotation, model architecture changes, or reinforcement learning.