Google Research

ViTGAN: Training GANs with Vision Transformers

ICLR (2022)


Recently, Vision Transformers (ViTs) have shown remarkable performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such observation can be extended to image generation. To this end, we integrate ViT architecture into generative adversarial networks (GANs). We observe that existing regularization methods for GANs interact poorly with self-attention, causing a serious instability during training. To resolve this issue, we introduce novel regularization methods for training GANs with ViTs. Our approach achieves comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work