Visual Planning: Let’s Think Only with Images

Han Zhou; Caiqi Zhang; Anna Korhonen; Chengzu Li; Xingchen Wan; Yi Xu; Ivan Vulic

Visual Planning: Let’s Think Only with Images

Han Zhou

Caiqi Zhang

Anna Korhonen

Chengzu Li

Xingchen Wan

Yi Xu

Ivan Vulic

International Conference on Learning Representations (ICLR) (2026)

Download Google Scholar

Abstract

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have significantly enhanced machine reasoning across diverse tasks. However, these models predominantly rely on language as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial, geometric, or physical dynamics. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of textual mediation. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We then introduce a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models, resulting in substantial improvements in planning accuracy and generalization across both seen and novel scenarios, validated in representative visual navigation tasks, FrozenLake and Maze. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Visual Planning: Let’s Think Only with Images

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs