Jump to Content

Kenji Hata

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Visual Program Tuning: Training Large Multimodal Models to Reason like Programs
    Yushi Hu
    Krishna Viswanathan
    Enming Luo
    Ranjay Krishna
    Ariel Fuxman
    Conference on Computer Vision and Pattern Recognition (2024)
    Preview abstract Solving complex visual tasks (e.g., “Who invented the musical instrument on the right?”) involves back-and-forth between visual processing and reasoning. Visual programming is a recent multimodal framework that has shown promise in conducting visual reasoning in an interpretable and compositional manner. However, this framework is error-prone—it can lead to a wrong answer whenever the program itself is wrong, or when any of the steps of the program are solved incorrectly, thus leading to worse overall performance than end-to-end systems trained with labeled data. Moreover, it is inefficient to involve multiple steps (i.e., generating and then running programs) during inference. Ideally, a single large multimodal model (LMM) should directly conduct similar reasoning and yield the correct answer. In this work, we propose Visual Program Tuning (VPT), which leverages visual programs for teaching LLMs to reason via instruction tuning. VPT rewrites the execution traces of visual programs as chain-of-thought reasoning steps, and tunes an LMM to output not only the label but its reasoning as well. Extensive experiments on complex vision tasks show that models trained with VPT achieve state-of-the-art accuracy while being able to produce interpretable and faithful reasoning steps. PaLI-X + VPT outperforms all existing LMMs on a wide range of visual tasks, improving performance on counting, spatial relations, and compositional reasoning tasks. VPT is also helpful for quick adaptation on new tasks. Our experiments on content moderation show that fine-tuning LMMs with program-augmented examples is more sample efficient than traditional supervised training. View details
    No Results Found