Yonatan Bitton

Yonatan Bitton

Yonatan Bitton is a Research Scientist at Google Tel Aviv, working on vision-and-language generalization and multimodal consistency.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Despite recent advancements, text-to-image (T2I) models still exhibit critical limitations, such as errors in understanding spatial relationships, object counting, text rendering, and more. One challenge in overcoming these failure modes is the lack of resources; the majority of existing image-text datasets provide only brief captions that do not offer sufficient detail to discrepancies between images and their descriptions. To advance the development of T2I models further, we introduce \textbf{Descriptions of Connected and Contrasting Images (DOCCI)}, a dataset of 15k images taken by a single person with detailed human-annotated descriptions in English. We meticulously annotated detailed and coherent descriptions, averaging 136 words, which sufficiently differentiate images from related or similar ones. We intentionally curated images that showcase a diverse range of visual properties, including entities with their attributes, various orientations, and lighting effects, many of which are related to each other. We thoroughly analyze the quality and characteristics of the image-description pairs, and assess the performance of the latest T2I and I2T models. The experimental results indicate that the current state-of-the-art T2I models still struggle with the aforementioned challenges, and even the SOTA models have not fully addressed them. DOCCI is publicly available, and we believe that this dataset will be a valuable benchmark for vision-language research. View details
    Preview abstract Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets. View details
    q2d: Turning Questions into Dialogs to Teach Models How to Search
    Shlomi Cohen-Ganor
    Ido Hakimi
    Yoad Lewenberg
    Enav Weinreb
    arXiv(2023)
    Preview abstract One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions. We prompt a large language model (PaLM) to create conversational versions of question answering datasets, and use it to improve query generation models that communicate with external search APIs to ground dialog responses. Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale. Our experiments demonstrate that: (1) For query generation on the QReCC dataset, models trained on our synthetically-generated data achieve 90%--97% of the performance of models trained on the human-generated data; (2) We can successfully generate data for training dialog models in new domains without any existing dialog data as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We perform a thorough analysis of the generated dialogs showing that humans find them of high quality and struggle to distinguish them from human-written dialogs. View details
    Preview abstract While existing image/text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text/image pairs. We leverage large language models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also introduce a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. View details
    VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
    Anas Awadalla
    Hritik Bansal
    Jack Hessel
    Josh GARDNER
    Ludwig Schmidt
    Rohan Taori
    Rulin Shao
    Wanrong Zhu
    NeurIPS 2023, Datasets and Benchmarks(2023)
    Preview abstract We introduce VisIT-Bench (Visual Instruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 "instruction families" that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned captions. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned captions describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at https://visit-bench.github.io/. View details
    Preview abstract Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While hallucination can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for tasks where the generated image needs to be grounded in a strict manner, e.g. as an illustration of a task, an action or in the context of a story. In this paper, we propose to strengthen the factual consistency properties of T2I methods in the presence of natural prompts. First, we cast the problem as an MT problem that translates natural prompts into visual prompts. Then we filter the image with a VQA approach where we answer a set of questions in the visual domain (the image) and in the natural language domain (the natural prompt). Finally, to measure the alignment of answers, we depart from the recent literature that do string matching, and compare answers in an embedding space that assesses the semantic and entailment associations between a natural prompt and its generated image. View details
    Preview abstract Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic image-text alignment evaluation. We first introduce a comprehensive evaluation set spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach based on synthetic data generation. Both methods surpass prior approaches in various text-image alignment tasks, with our analysis showing significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation. View details
    q2d: Automatic Dialog Generation to Improve Models' Query Generation
    Enav Weinreb
    Ido Hakimi
    Shlomi Cohen-Ganor
    Yoad Lewenberg
    EMNLP 2023(2023)
    Preview abstract We propose q2d: an automatic data generation pipeline that generates information-seeking dialogues based on questions. We apply our method to create conversational versions of questions answering datasets, which we release as a new dataset. We use this data to improve query generation models, which communicate with an external search APIs to generate factual responses. Unlike previous approaches, which relied on human annotators, our method allows to automatically generate labeled dialogues with better control and scale. In experiments, we demonstrate that: (1) Models trained on our synthetic data produce results comparable to those trained on natural data; (2) Our generated datasets are effective as a benchmark and as a training signal that generalizes to human-annotated test sets. We also provide an extensive analysis of the quality and factuality of the generated datasets. Our studies indicate that our automatic dialogue generation pipeline is effective at improving query generation and factuality. View details
    Preview abstract The alignment of diverse data modalities, especially video and text, is a significant challenge in AI. This study introduces VideoCon, a novel dataset for robust video-language alignment evaluation. It provides contrast captions for originally matched video-captions, complemented with natural language explanations (NLEs) that delineate the differences between the video and the contrast captions. Notably, VideoCon emphasizes temporally challenging scenarios to enhance the robustness of evaluations. To address misalignments observed in previous models, we propose AlignVideo, a video-language model trained on VideoCon that demonstrates enhanced alignment capabilities. Experiments reveal that AlignVideo surpasses existing baselines in video-text alignment and generates more precise NLEs. Moreover, it showcases state-of-the-art performance in zero-shot downstream tasks, emphasizing complex video understanding, such as action recognition and temporal event sequencing. Our work paves the way for advancements in video-text alignment evaluation and model development. View details