Overview

Astra flow teaser with observed context, action query, Astra-WM, in-context novel view, and Astra-VL answer — **Figure 1: Active imagination for spatial reasoning.** Astra casts spatial reasoning as an action-perception loop for visual evidence acquisition: a VLM queries a world simulator for a missing view, inspects the imagined observation, and grounds its answer in the updated visual context.

Spatial reasoning from multi-view images often requires evidence that is not visible in the given observations. Astra studies this problem as thinking with imagination: a VLM actively queries a world simulator for missing viewpoints and uses the returned imagined observations to answer spatial questions. The system couples Astra-VL, an RL-trained Qwen3-VL-based policy, with Astra-WM, a Bagel-based simulator trained for view consistency.

This framing resonates with enactive views of perception: understanding is not only internal processing over fixed inputs, but an active process of seeking the observations needed for skillful behavior. Astra brings this idea to VLMs through simulator-mediated interaction, turning visual spatial reasoning into active visual evidence acquisition.

M Method W World Simulator C Qualitative Visualization R Results

Thinking with Imagination

Why imagination? When only limited egocentric observations are available, complex spatial understanding remains challenging, and conventional text-oriented chain-of-thought often provides little benefit. Instead of reasoning only over fixed images, Astra lets the policy perform a controlled perceptual action: it asks for a camera-motion-conditioned view and uses the returned observation as new visual evidence.

**Figure 2: Astra framework.** Astra-VL closes the action-perception loop by planning simulator calls and answering questions, while Astra-WM maps camera-motion instructions to novel visual observations.

Astra-WM. A useful world simulator must preserve scene identity, follow the requested motion, and maintain relative layout across views. In this sense, Astra-WM learns a visual sensorimotor contingency: how a camera action should transform the observation while keeping the underlying scene stable. View consistency tuning makes the generated view reliable spatial evidence, not merely a plausible image.

Astra-VL. Simulator access alone is not enough. The policy must learn when to invoke the simulator, what motion to request, and how to ground the imagined observation. Astra-VL is trained with a two-phase simulator-in-the-loop RL curriculum: the first phase teaches valid tool interaction, while the second encourages selective imagination by comparing tool-augmented reasoning with direct answering. The result is a policy that uses imagination as a selective reasoning action rather than as a mandatory preprocessing step.

Action-Conditioned World Simulator

Astra-WM is not used as an open-ended image generator. It receives context images and a natural-language camera-motion instruction, then synthesizes a novel observation that should stay consistent with the same scene. This makes the simulator a mechanism for controlled visual intervention: the policy changes its perceptual input by choosing an action, then judges whether the imagined observation provides useful spatial evidence.

Context Action Nano-Banana Qwen-Image-Edit Bagel Astra-WM GT

Based on the last image, pitch up by 40 degrees.

Qwen-image-edit prediction for case 1955

Based on the last image, rotate 60 degrees to the right.

Qwen-image-edit prediction for case 2252

Based on the last image, turn 60 degrees to the left.

Qwen-image-edit prediction for case 3674

Based on the last image, turn 36 degrees to the right.

Based on the last image, rotate to the right by 30 degrees.

Qwen-image-edit prediction for case 3932

Qualitative comparison. Given the same context images and text action, generic image editors may produce plausible images but often fail to preserve the intended viewpoint change or scene layout. Astra-WM is trained as an action-conditioned world simulator, so the generated observation is expected to follow the camera motion while remaining comparable to the target view.

Simulator	Pose Cons.	Content Cons.	MMSI-Bench	PR.
Gemini-3-Flash, no simulator	--	--	45.1	45.8
+ Bagel	9.0 / 3.0	35.6 / 39.6 / 10.2	45.8	46.9
+ Astra-WM_30k	72.5 / 70.5	53.2 / 56.0 / 23.0	46.3	47.1
+ Astra-WM_60k	69.0 / 75.0	53.4 / 56.1 / 23.4	49.5	50.4

Simulator ablation. Scores follow the paper's pose-consistency and content-consistency evaluation. Content consistency is reported as object precision / recall / topology.

Finding 1

Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.

Analysis. This ablation shows why Astra-WM needs task-specific training. Off-the-shelf Bagel is visually plausible but weak at following the requested motion and preserving layout, so its reasoning gain is small. View consistency tuning turns generated views into more reliable spatial evidence: the best Astra-WM variant raises simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench and improves positional-relation accuracy from 45.8 to 50.4. This is why we present Astra-WM as an action-conditioned world simulator, not as generic novel-view synthesis.

Qualitative Visualization

Successful imagination

Imagination provides missing viewpoint evidence

The original views leave the spatial relation uncertain. Astra invokes the world simulator with a targeted camera motion, then uses the returned view to confirm the answer.

Complete Visual-CoT Trajectory

Analysis. These trajectories should be read as a reasoning-chain diagnostic. A successful case is not just "tool used"; the policy must identify the missing spatial evidence, issue an informative camera motion, ground the simulator output relative to the reference image, and then update the answer. The failure analysis in the paper highlights the same bottlenecks: uninformative actions, spatially inconsistent simulator outputs, and useful imagined observations that are ignored or misused.

Results

Experiments show that useful imagination requires both reliable simulation and a policy that controls simulator use. Forced tool use can hurt open-source VLMs, while Astra's learned agentic tool use improves spatial reasoning.

Setting	Model	MMSI-Bench	MindCube-Tiny
Direct Answer	Qwen3-VL-8B-Instruct	29.8	36.8
Forced Tool-Use	Qwen3-VL-8B-Instruct	28.6	27.6
Agentic Tool-Use	Astra	38.8	42.7
Forced Tool-Use	Gemini-3-Flash + Astra-WM	49.5	72.7

Table 1: Main benchmark results. Astra improves the Qwen3-VL-8B backbone by learning when and how to use the simulator.

Finding 2

Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.

Analysis. The main result separates "having a simulator" from "knowing how to use it." Forced tool-use degrades Qwen3-VL-8B from 29.8 to 28.6 on MMSI-Bench and from 36.8 to 27.6 on MindCube, showing that imagined observations can introduce noise when the policy is not trained to choose actions or ground generated views. Astra instead learns an agentic workflow and improves the same backbone to 38.8 and 42.7.

Training Recipe	Tool Rate	Calls / Row	PR.	Attr.	Mot.	MSR	All
Tool-gain only	4.9	0.049	36.5	36.8	28.4	31.2	34.3
Usage bonus only	98.1	1.400	39.4	38.7	30.2	32.5	36.1
Phase 1 only	98.0	1.120	40.1	39.2	30.8	32.9	36.8
Phase 1 -> Phase 2	61.5	0.780	42.3	41.0	32.1	33.6	38.8

Table 2: Two-phase RL curriculum. The full curriculum obtains the best MMSI-Bench score while reducing overuse relative to usage-bonus or Phase-1-only training.

Finding 3

Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.

Analysis. The two phases solve opposite failure modes. A sparse tool-gain reward collapses toward direct answering, with only a 4.9% tool-call rate. A dense usage bonus prevents collapse but over-corrects into near-universal simulator use. Phase 1 teaches valid tool interaction; Phase 2 changes the objective from "call the simulator" to "call it only when imagined evidence helps." This is why the strongest result uses fewer calls than the overuse baselines but achieves higher accuracy.

Inference-time workflow mode ablation — **Figure 3: Workflow mode ablation.** Agentic tool use preserves gains on camera-centric relations while avoiding unnecessary generated views for object- or region-centric questions.

Finding 4

Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps the camera-centric gains while avoiding generated views when the original context is already sufficient.

Analysis. The workflow-mode ablation explains why agentic control matters at inference time. Forced tool-use helps camera-centric relations, where alternative viewpoints often resolve the question, but it can hurt object- and region-centric relations where the original context is already sufficient. Agentic tool-use keeps the camera-centric gains while reducing unnecessary imagination, improving positional-relation accuracy from 36.4 in direct-answer mode and 40.1 in forced mode to 42.3.

BibTeX

@misc{zhu2026thinkingimaginationagenticvisual,
      title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators}, 
      author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
      year={2026},
      eprint={2606.06476},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.06476}, 
}