A collaborative approach to image generation

You have a perfect image in your mind. You enter a prompt, hit generate, and the result is close to what you were thinking, but not quite right. You try refining the prompt, adding more detail, but you can't seem to bridge the gap between your idea and the final image.

This is a common experience. While text-to-image (T2I) models are incredibly powerful, they often struggle to capture the nuance and specificity of an individual's unique creative intent given just a single prompt. What if we could turn image generation into a collaborative conversation?

In this post, we describe our research “Preference Adaptive and Sequential Text-to-image Agent” (PASTA), a reinforcement learning (RL) agent that collaborates with users to progressively refine T2I results. This approach eliminates the need for users to rely on trial-and-error prompt refinement to reach a desirable image. Through human evaluations, we created a novel dataset of sequential preferences, which we then used to compare PASTA with a baseline state-of-the-art model. The results demonstrated that PASTA, trained with our mix of real and simulated data, consistently produced images that users rated as more satisfying. We’ve also released our foundational dataset with a collection of over 7,000 human rater interactions with PASTA.

How PASTA works

To effectively train an AI agent to adapt to a user's individual preferences, a large, diverse set of interaction data is needed. However, gathering this data from real users is challenging due to several factors, including user privacy. To address this, we trained PASTA using a two-stage strategy that combines real human feedback with large-scale user simulation.

First, we collected a high-quality foundational dataset with over 7,000 raters' sequential interactions. These interactions included prompt expansions generated by a Gemini Flash large multimodal model and corresponding images generated by a Stable Diffusion XL (SDXL) T2I model. This initial seed of authentic preference data was then used to train a user simulator, designed to generate additional data that replicate real human choices and preferences.

At the heart of our method is a user model, comprising two key components: 1) a utility model that predicts the degree to which a user will like any set of images, and 2) a choice model that predicts which set of images they will select when presented with several sets. We constructed the user model using pre-trained CLIP encoders and added user-specific components. We trained the model using an expectation-maximization algorithm that allows us to simultaneously learn the specifics of user preferences while also discovering latent “user types,” that is, clusters of users with similar tastes (e.g., tendencies to prefer images with animals, scenic views, or abstract art).

The trained user simulator can provide feedback and express preferences on generated images, and make selections from sets of proposed images. This allows us to generate over 30,000 simulated interaction trajectories.. Our approach does more than just create more data; it gives us a controlled environment in which to explore a vast range of user behaviors so we can train the PASTA agent to effectively collaborate with users.

Our user simulator learns to identify distinct user types from preference data. Each row shows the top-rated images for an emergent user profile, revealing clear preferences for categories like "Animals" or "Food."

With this robust, data-driven foundation, the PASTA agent is trained to effectively engage with arbitrary users to generate images that match their preferences. The agent itself is a value-based reinforcement learning model that learns to select the best "slate" of prompt expansions (i.e., elaborations of the current prompt used to generate subsequent images) to show the user at each turn. Its goal is to maximize the user's cumulative satisfaction over the entire interaction.

Once PASTA is trained and deployed, a user initiates the engagement with an initial prompt. PASTA first uses a candidate generator (a large multimodal model) to create a diverse set of potential prompt expansions. Then, a candidate selector (our trained RL agent) selects the optimal slate of four such expansions, which are used to generate corresponding images to present to the user. The user selects the image that is closest to their vision, which provides feedback that guides PASTA's next set of suggestions. This collaborative back-and-forth allows the model to learn the user's preferences on the fly, steering the creative process toward their ideal goal with each step.

Starting with a simple prompt for "A white cat", PASTA engages the user in a visually grounded dialogue. The user's selections (highlighted in blue) help the agent quickly learn their preference for a more fantastical and colorful style.

Putting PASTA to the test

To evaluate our approach, we trained PASTA as a value-based reinforcement learning agent using implicit Q-learning (IQL). We specifically wanted to see how the use of different training data impacted performance. We created three versions of the agent: 1) trained only on the real volunteer-rater data, 2) trained only on the simulated data, and 3) trained on a combination of real and simulated datasets.

We then ran a series of human evaluations comparing these agents to a baseline model (i.e., base Gemini Flash and SDXL models with no additional training) across four metrics: accuracy over the Pick-a-Pic dataset, Spearman’s rank correlation, choice model accuracy, and cross turn accuracy. Pick-a-Pic accuracy and Spearman's rank correlation assess the model's ability to predict user preferences and rankings on existing, large-scale, single-turn datasets. Choice model accuracy and cross-turn accuracy measure the model's ability to predict which image a user will choose at a given turn and whether the selected images are an improvement over the previous turn, respectively.

The results demonstrated that training PASTA on synthetic data alone didn't beat the baseline and while the agent trained on real human data showed significant improvement, it also didn’t outperform the baseline. However, the agent trained on the combination of both real and simulated data offered the best performance, confirming that our user simulation successfully captures key dynamics of human interaction while providing the scale needed for robust RL training.

The graphs above present the accuracy performance of a trained user model (y axis) as a function of the number of user types considered (x axis). The top row displays the model’s accuracy on the Pick-a-Pic test set (left) and its Spearman’s rank correlation on the HPS test set (right). The bottom row shows the model’s choice accuracy (left) and cross-turn preference accuracy (right), both evaluated on our human-rated test set.

When we asked raters to directly compare the final images from our best-performing agent against the baseline, 85% preferred PASTA's generated images. The difference is especially striking with abstract prompts. Starting with a simple idea like "an image of love", PASTA adapted to different user types to create a wide variety of results, from tender portraits to abstract, geometric art.

With the same starting prompt, "An image of happiness", PASTA produces dramatically different results for two distinct user types (User Type A and User Type B), showcasing its ability to adapt to an individual's unique creative style. For example, the result for Type A corresponds to a prompt like “Abstract happy faces, Art Deco inspired geometric shapes, muted jewel-toned background.”

What's next?

PASTA shows that the future of generative AI can be more interactive, preference adaptive, and collaborative. The methods we developed, particularly the use of robust user simulators, can be applied to many other generative tasks to create AI that better aligns and adapts to human users.

To help spur further research, we have open-sourced our sequential rater dataset and our simulated user data. We can't wait to see what the community builds with it.

Acknowledgements

The author list is: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, and Craig Boutilier. Special thanks to Mark Simborg for his help crafting this blog post and Kimberly Schwede for creating the figures in this post.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A collaborative approach to image generation

Quick links

How PASTA works

Putting PASTA to the test

What's next?

Acknowledgements

Quick links

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A collaborative approach to image generation

Quick links

How PASTA works

Putting PASTA to the test

What's next?

Acknowledgements

Quick links

Other posts of interest