A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums

October 20, 2025

Weiwei Kong, Software Engineer, and Umar Syed, Research Scientist, Google Research

We introduce a method for generating differentially private synthetic photo albums that uses an intermediate text representation and produces the albums in a hierarchical fashion.

Differential privacy (DP) provides a powerful, mathematically rigorous assurance that sensitive individual information in a dataset remains protected, even when a dataset is used for analysis. Since DP’s inception nearly two decades ago, researchers have developed differentially private versions of myriad data analysis and machine learning methods, ranging from calculating simple statistics to fine-tuning complex AI models. However, the requirement for organizations to privatize every analytical technique can be complex, burdensome, and error-prone.

Generative AI models like Gemini offer a simpler, more efficient solution. Instead of separately modifying every analysis method, they create a single private synthetic version of the original dataset. This synthetic data is an amalgamation of common data patterns, containing no unique details from any individual user. By using a differentially private training algorithm, such as DP-SGD, to fine-tune the generative model on the original dataset, we ensure the synthetic dataset is both private and highly representative of the real data. Any standard, non-private analytical technique or modeling can then be performed on this safe (and highly representative) substitute dataset, simplifying workflows. DP fine-tuning is a versatile tool that is particularly valuable for generating high-volume, controlled datasets in situations where access to high-quality, representative data is unavailable.

Most published work on private synthetic data generation has focused on simple outputs like short text passages or individual images, but modern applications using multi-modal data (images, video, etc.) rely on modeling complex, real-world systems and behaviors, which simple, unstructured text data cannot adequately capture.

We introduce a new method for privately generating synthetic photo albums as a way to address this need for synthetic versions of rich, structured image-based datasets. This task presents unique challenges beyond generating individual images, specifically the need to maintain thematic coherence and character consistency across multiple photos within a sequential album. Our method is based on translating complex image data to text and back. Our results show that this process, with rigorous DP guarantees enabled, successfully preserves the high-level semantic information and thematic coherence in datasets necessary for effective analysis and modeling applications.

How (and why) our method works

Our method differs from most other approaches to generating private synthetic image data in two major respects: (1) we use an intermediate text representation and (2) we generate the data hierarchically.

Here’s how it works:

  1. We generate a structured text representation of each original album, replacing each photo in the album with an AI-generated detailed text caption, and also using an AI model to produce a text summary of each album.
  2. We then privately fine-tune a pair of large language models to produce similar structured representations. The first model is trained to generate album summaries, and the second model is trained to generate individual photo captions based on an album summary.
  3. We use the models to generate structured representations of photo albums in a hierarchical manner. For each photo album, we first generate a summary of the album, and then using that summary as context, we generate a detailed text caption of each photo in the album.
  4. The generated structured representations are then converted into sets of images using a text-to-image AI model.
Illustration of our method for generating synthetic photo albums.

Illustration of our method for generating synthetic photo albums.

Generating text as an intermediate step towards generating images has a number of advantages. First, text generation is the main strength of a large language model. Second, text summarization is inherently privacy enhancing, since describing an image by text is a lossy operation, so synthetic photos are unlikely to be exact copies of the originals, even when differential privacy is not enabled. Finally, generating images is far more costly than generating text, so by first generating text, we can filter albums based on their content before expending resources to produce the images in which we are most interested.

Our hierarchical generation strategy ensures that the photos in each album are internally consistent, since each photo caption in an album is generated with the same album summary as context. Also, generating the structured representations in two steps (first the album summaries, and then the photo captions) preserves significant computational resources relative to generating each representation in one shot. Since training cost scales quadratically with context length (due to self-attention), training two models with shorter contexts is far less costly than training a single model with a long context.

It may seem that describing images with words is too lossy an operation to preserve any interesting characteristics of the original images, but a simple demonstration (without differential privacy, to allow for side-by-side comparison) illustrates the power of this approach. In the figure below, we prompted Gemini to describe an image using several hundred words, and then fed the response text back to Gemini, prompting it to generate an image matching the description. While this circular series of transformations does not satisfy differential privacy, it does illustrate the utility of text as an intermediary for synthetic image generation. As the saying goes, a picture is worth a thousand words — and it seems that it is not worth much more than that!

Privately generated synthetic photo example

Left: Original image. Right: Synthetic image.

We asked Gemini to describe the original image in text, and then prompted Gemini to generate the synthetic image based on the text description.

Concurrent work by Wang et al. showed how one can leverage text-based intermediaries to generate differentially private single images using Private Evolution.

Evaluation and results

We tested our method on the YFCC100M dataset, a repository containing nearly 100 million images that have been released under the Creative Commons license. We formed “albums” from these images by grouping together photos taken by the same user within the same hour. We constructed training sets for the large language models described above, taking care that no user contributes more than one example to any training set (contribution bounding is necessary to ensure the validity of the differential privacy guarantee).

After applying our method to generate synthetic photo albums, we evaluated how well they resemble the original albums. First, we computed the MAUVE score, a neural embedding–based measure of semantic similarity, between the original and synthetic structured representations.

The figure below shows the MAUVE scores between real and synthetic album summaries, as well as real and synthetic photo captions, both before and after fine-tuning.

MAUVE scores between real and synthetic album summaries & captions

Left: MAUVE scores between real and synthetic album summaries. Right: MAUVE scores between real and synthetic photo captions. Higher MAUVE scores indicate greater similarity. Higher values of the privacy parameter ε imply weaker privacy constraints.

Next, we calculated the most common topics in the album summaries, shown in the table below, and found that they were very similar between real and synthetic data.

Real album summaries vs synthetic album summaries

Left: Most common topics in real album summaries. Right: Most common topics in synthetic album summaries.

Finally, direct visual examination of the synthetic photos albums shows that each album is typically centered on a common theme, just like real photo albums, as demonstrated by the examples in the figure below.

Privately generated synthetic photo albums

Two synthetically-generated photo albums. Each album maintains a specific theme (top: apple picking trip; bottom: couple visits a meadow).

Conclusion

The challenges of modern AI require data that is not only private, but also structurally and contextually rich, a need that simple, unstructured data can’t meet. By applying our hierarchical, text-as-intermediate method to the demanding task of generating coherent synthetic photo albums, we’ve successfully shown a pathway for extending the benefits of synthetic data beyond simple text or isolated images.

This methodology opens exciting new avenues for privacy-preserving AI innovation. It helps resolve the persistent tension between the need for large, high-quality data and the imperative to protect user privacy, paving the way for safer and more generalized AI development across critical industries.

Acknowledgements

This work is the result of a collaboration between many people at Google Research, including (in alphabetical order by last name): Kareem Amin, Alex Bie, Rudrajit Das, Alessandro Epasto, Weiwei Kong, Alex Kurakin, Natalia Ponomareva, Monica Ribero, Jane Shapiro, Umar Syed, and Sergei Vassilvitskii.