Bringing 3D shoppable products online with generative AI

Every day, billions of people shop online, hoping to replicate the best parts of in-store shopping. Seeing something that catches your eye, picking it up and inspecting it for yourself can be a key part of how we connect with products. But capturing the intuitive, hands-on nature of the store experience is nuanced and can be challenging to replicate on a screen. We know that technology can help bridge the gap, bringing key details to your fingertips with a quick scroll. But these online tools can be costly and time consuming for businesses to create at scale.

To address this, we developed new generative AI techniques to create high quality and shoppable 3D product visualizations from as few as three product images. Today, we're excited to share the latest advancement, powered by Google’s state-of-the-art video generation model, Veo. This technology is already enabling the generation of interactive 3D views for a wide range of product categories on Google Shopping

Examples of 3D product visualizations generated from photos.

First generation: Neural Radiance Fields (NeRFs)

In 2022, researchers from across Google came together to develop technologies to make product visualization more immersive. The initial efforts focused on using Neural Radiance Fields (NeRF) to learn a 3D representation of products to render novel views (i.e., novel view synthesis), like 360° spins, from five or more images of the product. This required solving many sub-problems, including selecting the most informative images, removing unwanted backgrounds, predicting 3D priors, estimating camera positions from a sparse set of object-centric images, and optimizing a 3D representation of the product.

That same year, we announced this breakthrough and launched the first milestone, interactive 360° visualizations of shoes on Google Search. While this technology was promising, it suffered from noisy input signals (e.g., inaccurate camera poses) and ambiguity from sparse input views. This challenge became apparent when attempting to reconstruct sandals and heels, whose thin structures and more complex geometry was tricky to reconstruct from just a handful of images.

This led us to wonder: could the recent advancements in generative diffusion models help us improve the learned 3D representation?

Our first-generation approach used neural radiance fields (NeRF) to render novel views, combining several 3D techniques like NOCS for XYZ prediction, CamP for camera optimization, and Zip-NeRF for state-of-the-art novel view synthesis from a sparse set of views.

Second generation: Scaling with a view-conditioned diffusion prior

In 2023, we introduced a second-generation approach which used a view-conditioned diffusion prior to address the limitations of the first approach. Being view-conditioned means that you can give it an image of the top of a shoe and ask the model “what does the front of this shoe look like?” In this way, we can use the view-conditioned diffusion model to help predict what the shoe looks like from any viewpoint, even if we only have photos of limited viewpoints.

In practice, we employ a variant of score distillation sampling (SDS), first proposed in DreamFusion. During training, we render the 3D model from a random camera view. We then use the view-conditioned diffusion model and the available posed images to generate a target from the same camera view. Finally, we calculate a score by comparing the rendered image and the generated target. This score directly informs the optimization process, refining the 3D model's parameters and enhancing its quality and realism.

This second-generation approach led to significant scaling advantages enabling us to generate 3D representations for many of the shoes viewed daily on Google Shopping. Today, you can find interactive 360° visualizations for sandals, heels, boots, and other footwear categories when you shop on Google, the majority of which are created by this technology!

The second-generation approach used a view-conditioned diffusion model based on the TryOn architecture. The diffusion model acts as a learned prior using score distillation sampling proposed in DreamFusion to improve the quality and fidelity of novel views.

Third generation: Generalizing with Veo

Our latest breakthrough builds on Veo, Google's state-of-the-art video generation. A key strength of Veo is its ability to generate videos that capture complex interactions between light, material, texture, and geometry. Its powerful diffusion-based architecture and its ability to be finetuned on a variety of multi-modal tasks enable it to excel at novel view synthesis.

To finetune Veo to transform product images into a consistent 360° video, we first curated a dataset of millions of high quality, 3D synthetic assets. We then rendered the 3D assets from various camera angles and lighting conditions. Finally, we created a dataset of paired images and videos and supervised Veo to generate 360° spins conditioned on one or more images.

We discovered that this approach generalized effectively across a diverse set of product categories, including furniture, apparel, electronics and more. Veo was not only able to generate novel views that adhered to the available product images, but it was also able to capture complex lighting and material interactions (i.e., shiny surfaces), something which was challenging for the first- and second-generation approaches.

The third-generation approach builds on Veo to generate 360° spins from one or more product images.

Furthermore, this third-generation approach avoided the need to estimate precise camera poses from a sparse set of object-centric product images, simplifying the problem and increasing reliability. The fine-tuned Veo approach is powerful — with one image, you can generate a realistic, 3D representation of the object. But like any generative 3D technology, Veo will need to hallucinate details from unseen views, for example, the back of the object when only a view of the front is available. As the number of input images increases, so does Veo's ability to generate high fidelity and high quality novel views. In practice, we found that as few as three images capturing most object surfaces are sufficient to improve the quality of the 3D images and reduce hallucinations.

Conclusion and future outlook

Over the last several years, there has been tremendous progress in 3D generative AI, from NeRF to view-conditioned diffusion models and now Veo. Each technology has played a key part in making online shopping feel more tangible and interactive. Looking ahead, we are excited to continue to push boundaries in this space and to help make shopping online increasingly delightful, informative, and engaging for our users.

Acknowledgements

This work was made possible by Philipp Henzler, Matthew Burruss, Matthew Levine, Laurie Zhang, Ke Yu, Chung-Yi Weng, Jason Y. Zhang, Changchang Wu, Ira Kemelmacher-Shlizerman, Carlos Hernandez, Keunhong Park, and Ricardo Martin-Brualla. We thank Aleksander Holynski, Ben Poole, Jon Barron, Pratul Srinivasan, Howard Zhou, Federico Tombari, and many more from Google Labs, Google DeepMind, and Google Shopping.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Bringing 3D shoppable products online with generative AI

First generation: Neural Radiance Fields (NeRFs)

Second generation: Scaling with a view-conditioned diffusion prior

Third generation: Generalizing with Veo

Conclusion and future outlook

Acknowledgements

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Bringing 3D shoppable products online with generative AI

First generation: Neural Radiance Fields (NeRFs)

Second generation: Scaling with a view-conditioned diffusion prior

Third generation: Generalizing with Veo

Conclusion and future outlook

Acknowledgements

Other posts of interest