Ariel Ephrat

Ariel Ephrat

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Teaching CLIP to Count to Ten
    Michal Irani
    Roni Paiss
    Shiran Zada
    Submission to CVPR 2023 (2023)
    Preview abstract Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification and text-to-image generation. However, these models have several well-documented limitations. They fail to encapsulate compositional concepts, such as counting objects in an image or the relations between objects. To the best of our knowledge, this work is the first to extend CLIP to handle object counting. We introduce a simple yet effective method to improve the quantitative understanding of vision-language models, while maintaining their overall performance on common benchmarks. Our method automatically augments image captions to create hard negative samples that differ from the original captions by only the number of objects. For example, an image of three dogs can be contrasted with the negative caption "Six dogs playing in the yard". A dedicated loss encourages discrimination between the correct caption and its negative variant. We introduce CountBench, a new benchmark for evaluating a model's understanding of object counting, and demonstrate significant improvement over baseline models on this task. Furthermore, we leverage our improved CLIP representations for image generation, and show that our model can produce specific counts of objects more reliably than existing ones. View details
    Preview abstract We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly. View details
    Preview abstract We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest). View details
    Preview abstract We present a model for isolating and enhancing speech of desired speakers in a video. The input is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. We leverage both audio and visual features for this task, which are fed into a joint audio-visual source separation model we designed and trained using thousands of hours of video segments with clean speech from our new dataset, AVSpeech-90K. We present results for various real, practical scenarios involving heated debates and interviews, noisy bars and screaming children, only requiring users to specify the face of the person in the video whose speech they would like to isolate. View details