Jump to Content
Ira Kemelmacher-Shlizerman

Ira Kemelmacher-Shlizerman

Ira Kemelmacher-Shlizerman is a Principal Scientist and lead for Gen AI / AR for Google Shopping.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present FederNeRF, a method that takes a collection of photos of a subject (e.g. Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. NeRFederer builds a customized neural volumetric 3D model of the subject that is able to render an entire space spanned by camera viewpoint, body pose, and appearance. A central challenge in this task is dealing with sparse observations; a given body pose is likely only observed by a single viewpoint with a single appearance, and a given appearance is only observed under a handful of different body poses. We address this issue by recovering a canonical T-pose neural volumetric representation of the subject that allows for changing appearance across different observations, but uses a shared pose-dependent motion field across all observations. We demonstrate that this approach, along with regularization of the recovered volumetric geometry to encourage smoothness, is able to recover a model that renders compelling images from novel combinations of viewpoint, pose, and appearance from these challenging unstructured photo collections, outperforming prior work for free-viewpoint human rendering. View details
    DreamPose: Fashion Video Synthesis with Stable Diffusion
    Johanna Karras
    Aleksander Hołyński
    Ting-Chun Wang
    ICCV (2023)
    Preview abstract We present DreamPose, a diffusion model-based method to generate fashion videos from still images. Given an image and pose sequence, our method realistically animates both human and fabric motions as a function of body poses. Unlike past image-to-video approaches, we transform a pretrained text-to-image (T2I) stable diffusion model into an pose-guided video synthesis model, achieving high-quality results at a fraction of the computational cost of traditional video diffusion methods [13]. In our approach, we introduce a novel encoder architecture that enables Stable Diffusion to be conditioned directly on image embeddings, eliminating the need for intermediate text embeddings of any kind. We additionally demonstrate that concatenating target poses with the input noise is a simple yet effective means to condition the output frame on poses. Our quantitative and qualitative results show that DreamPose achieves state-of-the-art results on fashion video synthesis. View details
    TryOnDiffusion: A Tale of Two U-Nets
    Luyang Zhu
    Fitsum Reda
    William Chan
    Chitwan Saharia
    Mohammad Norouzi
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, IEEE, NA, pp. 1
    Preview abstract Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively. View details
    HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video
    Chung-Yi Weng
    Pratul Srinivasan
    CVPR (Computer Vision and Pattern Recognition), IEEE and the Computer Vision Foundation (2022) (to appear)
    Preview abstract We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challenging, as it requires synthesizing photorealistic details of the body, as seen from various camera angles that may not exist in the input video, as well as synthesizing fine details such as cloth folds and facial appearance. Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps. The motion field is decomposed into skeletal rigid and non-rigid motions, produced by deep networks. We show significant performance improvements over prior work, and compelling examples of free-viewpoint renderings from monocular video of moving humans in challenging uncontrolled capture scenarios. View details
    Preview abstract Given a pair of images—target person and garment on another person—we automatically generate the target person in the given garment. Previous methods mostly focused on texture transfer via paired data training, while overlooking body shape deformations, skin color, and seamless blending of garment with the person. This work focuses on those three components, while also not requiring paired data training. We designed a pose conditioned StyleGAN2 architecture with a clothing segmentation branch that is trained on images of people wearing garments. Once trained, we propose a new layered latent space interpolation method that allows us to preserve and synthesize skin color and target body shape while transferring the garment from a different person. We demonstrate results on high resolution 512x512 images, and extensively compare to state of the art in try-on on both latent space generated and real images. View details
    No Results Found