Preference Adaptive and Sequential Text-to-Image Generation
Abstract
We consider the problem of sequential text-to-image generation. Specifically, we formulate a personalized interactive framework, where an agent iteratively improves a user's prompt through a series of sequential prompt expansions. We formulate the problem as a sequential decision-making task. Using human raters, we create a dataset of sequential preferences for this problem. We then leverage our sequential data, together with large-scale open-source non-sequential datasets to construct user-preference and user-choice models. Particularly, we employ an EM strategy to develop a personalized sequential user model. We then leverage a multi-modal large language model (MM-LLM) and a value-based reinforcement learning (RL) agent to suggest a personalized and diverse slate of prompt expansions to the user. Our Personalized And Sequential Text-to-image Agent (PASTA) empowers diffusion models with personalized multi-turn capabilities, fostering collaborative co-creation, and addressing uncertainties or under-specifications in user intent. We evaluate our agent using human raters, showing significant improvement compared to baseline methods. We also release our sequential rater dataset and additional simulated data of user-agent interactions to advance future research in personalized multi-turn text-to-image generation.