Inbar Mosseri
Research Areas
Authored Publications
Sort By
Teaching CLIP to Count to Ten
Michal Irani
Roni Paiss
Shiran Zada
Submission to CVPR 2023 (2023)
Preview abstract
Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification and text-to-image generation. However, these models have several well-documented limitations. They fail to encapsulate compositional concepts, such as counting objects in an image or the relations between objects.
To the best of our knowledge, this work is the first to extend CLIP to handle object counting. We introduce a simple yet effective method to improve the quantitative understanding of vision-language models, while maintaining their overall performance on common benchmarks.
Our method automatically augments image captions to create hard negative samples that differ from the original captions by only the number of objects. For example, an image of three dogs can be contrasted with the negative caption "Six dogs playing in the yard". A dedicated loss encourages discrimination between the correct caption and its negative variant.
We introduce CountBench, a new benchmark for evaluating a model's understanding of object counting, and demonstrate significant improvement over baseline models on this task. Furthermore, we leverage our improved CLIP representations for image generation, and show that our model can produce specific counts of objects more reliably than existing ones.
View details
Imagic: Non-Rigid Real Image Editing with Text-Conditioned Diffusion Models
Bahjat Kawar
Huiwen Chang
Michal Irani
Shiran Zada
arxiv (2023) (to appear)
Preview abstract
Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently limited to simple edits (e.g., painting something on an object), are applied to synthetically generated images, or require multiple input images of a common object.
In this paper we demonstrate, for the very first time, the ability to apply complex non-rigid edits to a single real image -- i.e., change the pose of an object inside a real image, while preserving the remaining parts of the image. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user.
Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the scene/object).
Our method, which we call Imagic, leverages a pre-trained text-to-image diffusion model for this task. It modifies the text embedding to satisfy both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance.
We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing high quality complex image edits.
View details
Self-Distilled StyleGAN: Towards Generation from Internet Photos
Ron Mokady
Michal Yarom
Michal Irani
Proceedings of the 49th Annual Conference on Computer Graphics and Interactive Techniques (2022)
Preview abstract
StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated.
In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (i) A generative-based self-filtering of the dataset to eliminate out-of-distribution images, in order to generate an adequate training set, and (ii) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN’s “truncation trick” in the image synthesis process. The presented technique enables the generation of high-quality images, while better reserving the diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet. New datasets and pre-trained models will be published upon acceptance.
View details
Explaining in Style: Training a GAN to explain a classifier in StyleSpace
Yossi Gandelsman
Michal Yarom
Yoav Itzhak Wald
Phillip Isola
Michal Irani
Proc. ICCV 2021
Preview abstract
Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. A natural source for such attributes is the S-space of StyleGAN, which is known to generate semantically meaningful dimensions in the image. However, these will typically not correspond to classifier-specific attributes since standard GAN training is not dependent on the classifier. To overcome this, we propose training procedure for
a StyleGAN, which incorporates the classifier model. This results in an S-space that captures distinct attributes underlying classifier outputs. After training, the model can be used to visualize the effect of changing multiple attributes per image, thus providing an image-specific explanation. We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be changed in different ways to change its classifier prediction.
Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are interpretable as measured in user-studies.
View details
Semantic Pyramid for Image Generation
Assaf Shocher
Yossi Gandelsman
Michal Yarom
Michal Irani
Proc. IEEE Computer Vision and Pattern Recognition (CVPR) (2020)
Preview abstract
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid - a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
View details
Preview abstract
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained object recognition model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the recognition model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
View details
SpeedNet: Learning the Speediness in Videos
Sagie Benaim
Michal Irani
Proc. CVPR 2020
Preview abstract
We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.
View details
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
ACM Transactions on Graphics (Proc. SIGGRAPH), 37 (2018)
Preview abstract
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
View details
Looking to Listen at the Cocktail Party: Audio-visual Speech Separation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Preview abstract
We present a model for isolating and enhancing speech of desired speakers in a video. The input is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. We leverage both audio and visual features for this task, which are fed into a joint audio-visual source separation model we designed and trained using thousands of hours of video segments with clean speech from our new dataset, AVSpeech-90K. We present results for various real, practical scenarios involving heated debates and interviews, noisy bars and screaming children, only requiring users to specify the face of the person in the video whose speech they would like to isolate.
View details
XGAN: Unsupervised Image-to-Image Translation for many-to-many Mappings
Amelie Royer
Stephan Gouws
Fred Bertsch
ICML Workshop (2017)
Preview abstract
Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset we collected for this purpose is in the process of being released as a new benchmark for semantic style transfer.
View details