Thinking with Imagination

Agentic Visual Spatial Reasoning with World Simulators

Chenming Zhu1,2,* Jingli Lin2,3,* Yilin Long2,4 Peizhou Cao2,5 Jiangmiao Pang2 Tai Wang2,‡ Xihui Liu1,‡
1The University of Hong Kong 2Shanghai AI Laboratory 3Shanghai Jiao Tong University 4Fudan University 5Beihang University
*Equal contribution Corresponding author
Agentic VLM policy and reasoner icon
Policy & Reasoner: Astra-VL decides when to imagine, plans the camera-motion query, and grounds the returned view before answering.
World simulator icon
World Simulator: Astra-WM generates action-conditioned novel views from context images and natural-language camera-motion instructions.
Two-phase RL icon
Two-Phase RL: A simulator-in-the-loop curriculum first teaches valid tool interaction, then rewards selective imagination.
Spatial reasoning results icon
Spatial Reasoning Gains: Astra improves Qwen3-VL-8B from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube.
Astra flow teaser: observed context, action query, world simulator, imagined view, and answer

Overview

Astra flow teaser with observed context, action query, Astra-WM, in-context novel view, and Astra-VL answer
Figure 1: Active imagination for spatial reasoning. Astra casts spatial reasoning as an action-perception loop for visual evidence acquisition: a VLM queries a world simulator for a missing view, inspects the imagined observation, and grounds its answer in the updated visual context.

Spatial reasoning from multi-view images often requires evidence that is not visible in the given observations. Astra studies this problem as thinking with imagination: a VLM actively queries a world simulator for missing viewpoints and uses the returned imagined observations to answer spatial questions. The system couples Astra-VL, an RL-trained Qwen3-VL-based policy, with Astra-WM, a Bagel-based simulator trained for view consistency.

This framing resonates with enactive views of perception: understanding is not only internal processing over fixed inputs, but an active process of seeking the observations needed for skillful behavior. Astra brings this idea to VLMs through simulator-mediated interaction, turning visual spatial reasoning into active visual evidence acquisition.

M Method W World Simulator C Qualitative Visualization R Results

Thinking with Imagination

Why imagination? When only limited egocentric observations are available, complex spatial understanding remains challenging, and conventional text-oriented chain-of-thought often provides little benefit. Instead of reasoning only over fixed images, Astra lets the policy perform a controlled perceptual action: it asks for a camera-motion-conditioned view and uses the returned observation as new visual evidence.

Astra framework
Figure 2: Astra framework. Astra-VL closes the action-perception loop by planning simulator calls and answering questions, while Astra-WM maps camera-motion instructions to novel visual observations.

Astra-WM. A useful world simulator must preserve scene identity, follow the requested motion, and maintain relative layout across views. In this sense, Astra-WM learns a visual sensorimotor contingency: how a camera action should transform the observation while keeping the underlying scene stable. View consistency tuning makes the generated view reliable spatial evidence, not merely a plausible image.

Astra-VL. Simulator access alone is not enough. The policy must learn when to invoke the simulator, what motion to request, and how to ground the imagined observation. Astra-VL is trained with a two-phase simulator-in-the-loop RL curriculum: the first phase teaches valid tool interaction, while the second encourages selective imagination by comparing tool-augmented reasoning with direct answering. The result is a policy that uses imagination as a selective reasoning action rather than as a mandatory preprocessing step.

Action-Conditioned World Simulator

Astra-WM is not used as an open-ended image generator. It receives context images and a natural-language camera-motion instruction, then synthesizes a novel observation that should stay consistent with the same scene. This makes the simulator a mechanism for controlled visual intervention: the policy changes its perceptual input by choosing an action, then judges whether the imagined observation provides useful spatial evidence.

Context Action Nano-Banana Qwen-Image-Edit Bagel Astra-WM GT
Case 1955 context image 1 Case 1955 context image 2
Based on the last image, pitch up by 40 degrees.
Nano-Banana prediction for case 1955
Qwen-image-edit prediction for case 1955
Bagel prediction for case 1955
Astra-WM prediction for case 1955
Ground-truth view for case 1955
Case 2252 context image 1 Case 2252 context image 2
Based on the last image, rotate 60 degrees to the right.
Nano-Banana prediction for case 2252
Qwen-image-edit prediction for case 2252
Bagel prediction for case 2252
Astra-WM prediction for case 2252
Ground-truth view for case 2252
Case 3674 context image 1 Case 3674 context image 2
Based on the last image, turn 60 degrees to the left.
Nano-Banana prediction for case 3674
Qwen-image-edit prediction for case 3674
Bagel prediction for case 3674
Astra-WM prediction for case 3674
Ground-truth view for case 3674
Case 958 context image 1 Case 958 context image 2
Based on the last image, turn 36 degrees to the right.
Nano-Banana prediction for case 958
Qwen-image-edit prediction for case 958
Bagel prediction for case 958
Astra-WM prediction for case 958
Ground-truth view for case 958
Case 3932 context image 1 Case 3932 context image 2
Based on the last image, rotate to the right by 30 degrees.
Nano-Banana prediction for case 3932
Qwen-image-edit prediction for case 3932
Bagel prediction for case 3932
Astra-WM prediction for case 3932
Ground-truth view for case 3932

Qualitative comparison. Given the same context images and text action, generic image editors may produce plausible images but often fail to preserve the intended viewpoint change or scene layout. Astra-WM is trained as an action-conditioned world simulator, so the generated observation is expected to follow the camera motion while remaining comparable to the target view.

Simulator Pose Cons. Content Cons. MMSI-Bench PR.
Gemini-3-Flash, no simulator -- -- 45.1 45.8
+ Bagel 9.0 / 3.0 35.6 / 39.6 / 10.2 45.8 46.9
+ Astra-WM30k 72.5 / 70.5 53.2 / 56.0 / 23.0 46.3 47.1
+ Astra-WM60k 69.0 / 75.0 53.4 / 56.1 / 23.4 49.5 50.4
Simulator ablation. Scores follow the paper's pose-consistency and content-consistency evaluation. Content consistency is reported as object precision / recall / topology.
Finding 1

Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.

Analysis. This ablation shows why Astra-WM needs task-specific training. Off-the-shelf Bagel is visually plausible but weak at following the requested motion and preserving layout, so its reasoning gain is small. View consistency tuning turns generated views into more reliable spatial evidence: the best Astra-WM variant raises simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench and improves positional-relation accuracy from 45.8 to 50.4. This is why we present Astra-WM as an action-conditioned world simulator, not as generic novel-view synthesis.

Qualitative Visualization

Successful imagination

Imagination provides missing viewpoint evidence

The original views leave the spatial relation uncertain. Astra invokes the world simulator with a targeted camera motion, then uses the returned view to confirm the answer.

Observation 1 Case 1 observation 1
Observation 2 Case 1 observation 2

Complete Visual-CoT Trajectory

Analysis. These trajectories should be read as a reasoning-chain diagnostic. A successful case is not just "tool used"; the policy must identify the missing spatial evidence, issue an informative camera motion, ground the simulator output relative to the reference image, and then update the answer. The failure analysis in the paper highlights the same bottlenecks: uninformative actions, spatially inconsistent simulator outputs, and useful imagined observations that are ignored or misused.

Results

Experiments show that useful imagination requires both reliable simulation and a policy that controls simulator use. Forced tool use can hurt open-source VLMs, while Astra's learned agentic tool use improves spatial reasoning.

Setting Model MMSI-Bench MindCube-Tiny
Direct Answer Qwen3-VL-8B-Instruct 29.8 36.8
Forced Tool-Use Qwen3-VL-8B-Instruct 28.6 27.6
Agentic Tool-Use Astra 38.8 42.7
Forced Tool-Use Gemini-3-Flash + Astra-WM 49.5 72.7
Table 1: Main benchmark results. Astra improves the Qwen3-VL-8B backbone by learning when and how to use the simulator.
Finding 2

Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.

Analysis. The main result separates "having a simulator" from "knowing how to use it." Forced tool-use degrades Qwen3-VL-8B from 29.8 to 28.6 on MMSI-Bench and from 36.8 to 27.6 on MindCube, showing that imagined observations can introduce noise when the policy is not trained to choose actions or ground generated views. Astra instead learns an agentic workflow and improves the same backbone to 38.8 and 42.7.

Training Recipe Tool Rate Calls / Row PR. Attr. Mot. MSR All
Tool-gain only 4.9 0.049 36.5 36.8 28.4 31.2 34.3
Usage bonus only 98.1 1.400 39.4 38.7 30.2 32.5 36.1
Phase 1 only 98.0 1.120 40.1 39.2 30.8 32.9 36.8
Phase 1 -> Phase 2 61.5 0.780 42.3 41.0 32.1 33.6 38.8
Table 2: Two-phase RL curriculum. The full curriculum obtains the best MMSI-Bench score while reducing overuse relative to usage-bonus or Phase-1-only training.
Finding 3

Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.

Analysis. The two phases solve opposite failure modes. A sparse tool-gain reward collapses toward direct answering, with only a 4.9% tool-call rate. A dense usage bonus prevents collapse but over-corrects into near-universal simulator use. Phase 1 teaches valid tool interaction; Phase 2 changes the objective from "call the simulator" to "call it only when imagined evidence helps." This is why the strongest result uses fewer calls than the overuse baselines but achieves higher accuracy.

Inference-time workflow mode ablation
Figure 3: Workflow mode ablation. Agentic tool use preserves gains on camera-centric relations while avoiding unnecessary generated views for object- or region-centric questions.
Finding 4

Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps the camera-centric gains while avoiding generated views when the original context is already sufficient.

Analysis. The workflow-mode ablation explains why agentic control matters at inference time. Forced tool-use helps camera-centric relations, where alternative viewpoints often resolve the question, but it can hurt object- and region-centric relations where the original context is already sufficient. Agentic tool-use keeps the camera-centric gains while reducing unnecessary imagination, improving positional-relation accuracy from 36.4 in direct-answer mode and 40.1 in forced mode to 42.3.

BibTeX

@misc{zhu2026thinkingimaginationagenticvisual,
      title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators}, 
      author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
      year={2026},
      eprint={2606.06476},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.06476}, 
}