ViTGAN: Training GANs with Vision Transformers
Abstract
Recently, Vision Transformers (ViTs) have shown remarkable performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such observation can be extended to image generation. To this end, we integrate ViT architecture into generative adversarial networks (GANs). We observe that existing regularization methods for GANs interact poorly with self-attention, causing a serious instability during training. To resolve this issue, we introduce novel regularization methods for training GANs with ViTs. Our approach achieves comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets.