- Ben Usman
- Nick Dufour
- Kate Saenko
- Chris Bregler
In this work we propose a model that enables controlled manipulation of visual attributes of real
target'' images (\eg lighting, expression or pose) using only implicit supervision with the syntheticsource'' exemplars. Specifically, our model learns a shared low-dimensional representation of input images from both domains in which a property of interest is isolated from other content features of the input. By using triplets of synthetic images that demonstrate modification of the visual attribute that we would like to control (for example mouth opening) we are able to perform disentanglement of image representations with respect to this attribute without using explicit attribute labels in either domain. Since our technique relies on triplets instead of explicit labels, it can be applied to shape, texture, lighting, or other properties that are difficult to measure or represent as explicit conditioners. We quantitatively analyze the degree to which trained models learn to isolate the property of interest from other content features with a proof-of-concept digit dataset and demonstrate results in a far more difficult setting, learning to manipulate real faces using a synthetic 3D faces dataset. We also explore limitations of our model with respect to differences in distributions of properties observed in two domains.