Tali Dekel
I'm a Senior Research Scientist at Google, Cambridge, developing algorithms at the intersection of computer vision and computer graphics. Before Google, I was a Postdoctoral Associate at the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, working with Prof. William T. Freeman. I completed my Ph.D studies at the school of electrical engineering, Tel-Aviv University, Israel, under the supervision of Prof. Shai Avidan, and Prof. Yael Moses. My research interests include computational photography, image synthesize, geometry and 3D reconstruction.
Research Areas
Authored Publications
Sort By
Teaching CLIP to Count to Ten
Michal Irani
Roni Paiss
Shiran Zada
Submission to CVPR 2023 (2023)
Preview abstract
Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification and text-to-image generation. However, these models have several well-documented limitations. They fail to encapsulate compositional concepts, such as counting objects in an image or the relations between objects.
To the best of our knowledge, this work is the first to extend CLIP to handle object counting. We introduce a simple yet effective method to improve the quantitative understanding of vision-language models, while maintaining their overall performance on common benchmarks.
Our method automatically augments image captions to create hard negative samples that differ from the original captions by only the number of objects. For example, an image of three dogs can be contrasted with the negative caption "Six dogs playing in the yard". A dedicated loss encourages discrimination between the correct caption and its negative variant.
We introduce CountBench, a new benchmark for evaluating a model's understanding of object counting, and demonstrate significant improvement over baseline models on this task. Furthermore, we leverage our improved CLIP representations for image generation, and show that our model can produce specific counts of objects more reliably than existing ones.
View details
Imagic: Non-Rigid Real Image Editing with Text-Conditioned Diffusion Models
Bahjat Kawar
Huiwen Chang
Michal Irani
Shiran Zada
arxiv (2023) (to appear)
Preview abstract
Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently limited to simple edits (e.g., painting something on an object), are applied to synthetically generated images, or require multiple input images of a common object.
In this paper we demonstrate, for the very first time, the ability to apply complex non-rigid edits to a single real image -- i.e., change the pose of an object inside a real image, while preserving the remaining parts of the image. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user.
Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the scene/object).
Our method, which we call Imagic, leverages a pre-trained text-to-image diffusion model for this task. It modifies the text embedding to satisfy both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance.
We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing high quality complex image edits.
View details
Self-Distilled StyleGAN: Towards Generation from Internet Photos
Ron Mokady
Michal Yarom
Michal Irani
Proceedings of the 49th Annual Conference on Computer Graphics and Interactive Techniques (2022)
Preview abstract
StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated.
In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (i) A generative-based self-filtering of the dataset to eliminate out-of-distribution images, in order to generate an adequate training set, and (ii) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN’s “truncation trick” in the image synthesis process. The presented technique enables the generation of high-quality images, while better reserving the diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet. New datasets and pre-trained models will be published upon acceptance.
View details
SpeedNet: Learning the Speediness in Videos
Sagie Benaim
Michal Irani
Proc. CVPR 2020
Preview abstract
We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.
View details
Preview abstract
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained object recognition model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the recognition model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
View details
Semantic Pyramid for Image Generation
Assaf Shocher
Yossi Gandelsman
Michal Yarom
Michal Irani
Proc. IEEE Computer Vision and Pattern Recognition (CVPR) (2020)
Preview abstract
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid - a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
View details
Layered Neural Rendering for Retiming People in Video
Erika Lu
Weidi Xie
Andrew Zisserman
ACM Transactions on Graphics (Proc. SIGGRAPH Asia) (2020)
Preview abstract
We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate---e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running.
View details
Learning the Depths of Moving People by Watching Frozen People
Zhengqi Li
Ce Liu
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Preview abstract
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and often can recover only a sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a large corpus of data. Specifically, we use a new source of data comprised of thousands of Internet videos in which people imitate mannequins, i.e., people freeze in diverse, natural poses, while a hand-held camera is touring the scene. We then create training data using modern Multi-View Stereo (MVS) methods, and design a model that is applied to dynamic scene at inference time. Our method makes use of motion parallax beyond single view and shows clear advantages over state-of-the-art monocular depth prediction methods. We demonstrate the applicability of our method on real-world sequences captured by a moving hand-held camera, depicting complex human actions. We show various 3D effects such as re-focusing, creating a stereoscopic video from a monocular one, and inserting virtual objects to the scene, all produced using our predicted depth maps.
View details
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
ACM Transactions on Graphics (Proc. SIGGRAPH), 37 (2018)
Preview abstract
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
View details
Sparse, Smart Contours to Represent and Edit Images
Ce Liu
Chuang Gan
Dilip Krishnan
Computer Vision and Pattern Recognition (2018)
Preview abstract
We study the problem of reconstructing an image from information stored at sparse contour locations comprising less than $6\%$ of image pixels. This extremely sparse representation provides an intuitive interface for semantically-aware image manipulation. Local edits in contour domain translate to long-range and coherent changes in pixel space. We use generative adversarial networks to synthesize texture and structure even in regions where no input information is provided. With our setup, we can perform complex structural changes such as changing facial expression and interpolating animal fur texture by simple edits of contours such as scaling, moving and erasing. Experiments on a variety of datasets verify the versatility and convenience afforded by our models.
View details