Face0: Instantaneously Conditioning a Text-to-Image Model on a Face

Dani Valevski
Danny Wasserman
Yaniv Leviathan
SIGGRAPH Asia 2023 Conference Papers(2023)


We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face, in sample time, without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model (we use Stable Diffusion) on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate images, given a user-supplied face image, unknown at training time, and a prompt, in just a couple of seconds. While other methods, especially those that receive multiple user supplied images, suffer less from identity loss, our method still achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using random vectors instead of face embeddings from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, while requiring further research, we hope that our method, which decouples the model’s textual biases from its biases on faces, might be a step towards some mitigation of biases in future text-to-image models.