Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He; Weixi Feng; Tsu-Jui Fu; Varun Jampani; Arjun Akula; Pradyumna Narayana; Sugato Basu; William Yang Wang; Xin Eric Wang

Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He

Weixi Feng

Tsu-Jui Fu

Varun Jampani

Arjun Akula

Pradyumna Narayana

Sugato Basu

William Yang Wang

Xin Eric Wang

ArXiv (2023)

Download Google Scholar

Abstract

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Discriminative Diffusion Models as Few-shot Vision and Language Learners

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs