Spatial reasoning from multi-view images often requires evidence that is not visible in the given observations. Astra studies this problem as thinking with imagination: a VLM actively queries a world simulator for missing viewpoints and uses the returned imagined observations to answer spatial questions. The system couples Astra-VL, an RL-trained Qwen3-VL-based policy, with Astra-WM, a Bagel-based simulator trained for view consistency.
This framing resonates with enactive views of perception: understanding is not only internal processing over fixed inputs, but an active process of seeking the observations needed for skillful behavior. Astra brings this idea to VLMs through simulator-mediated interaction, turning visual spatial reasoning into active visual evidence acquisition.
Why imagination? When only limited egocentric observations are available, complex spatial understanding remains challenging, and conventional text-oriented chain-of-thought often provides little benefit. Instead of reasoning only over fixed images, Astra lets the policy perform a controlled perceptual action: it asks for a camera-motion-conditioned view and uses the returned observation as new visual evidence.
Astra-WM. A useful world simulator must preserve scene identity, follow the requested motion, and maintain relative layout across views. In this sense, Astra-WM learns a visual sensorimotor contingency: how a camera action should transform the observation while keeping the underlying scene stable. View consistency tuning makes the generated view reliable spatial evidence, not merely a plausible image.
Astra-VL. Simulator access alone is not enough. The policy must learn when to invoke the simulator, what motion to request, and how to ground the imagined observation. Astra-VL is trained with a two-phase simulator-in-the-loop RL curriculum: the first phase teaches valid tool interaction, while the second encourages selective imagination by comparing tool-augmented reasoning with direct answering. The result is a policy that uses imagination as a selective reasoning action rather than as a mandatory preprocessing step.
Astra-WM is not used as an open-ended image generator. It receives context images and a natural-language camera-motion instruction, then synthesizes a novel observation that should stay consistent with the same scene. This makes the simulator a mechanism for controlled visual intervention: the policy changes its perceptual input by choosing an action, then judges whether the imagined observation provides useful spatial evidence.
Qualitative comparison. Given the same context images and text action, generic image editors may produce plausible images but often fail to preserve the intended viewpoint change or scene layout. Astra-WM is trained as an action-conditioned world simulator, so the generated observation is expected to follow the camera motion while remaining comparable to the target view.
| Simulator | Pose Cons. | Content Cons. | MMSI-Bench | PR. |
|---|---|---|---|---|
| Gemini-3-Flash, no simulator | -- | -- | 45.1 | 45.8 |
| + Bagel | 9.0 / 3.0 | 35.6 / 39.6 / 10.2 | 45.8 | 46.9 |
| + Astra-WM30k | 72.5 / 70.5 | 53.2 / 56.0 / 23.0 | 46.3 | 47.1 |
| + Astra-WM60k | 69.0 / 75.0 | 53.4 / 56.1 / 23.4 | 49.5 | 50.4 |
Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.
Analysis. This ablation shows why Astra-WM needs task-specific training. Off-the-shelf Bagel is visually plausible but weak at following the requested motion and preserving layout, so its reasoning gain is small. View consistency tuning turns generated views into more reliable spatial evidence: the best Astra-WM variant raises simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench and improves positional-relation accuracy from 45.8 to 50.4. This is why we present Astra-WM as an action-conditioned world simulator, not as generic novel-view synthesis.
Successful imagination
The original views leave the spatial relation uncertain. Astra invokes the world simulator with a targeted camera motion, then uses the returned view to confirm the answer.
Analysis. These trajectories should be read as a reasoning-chain diagnostic. A successful case is not just "tool used"; the policy must identify the missing spatial evidence, issue an informative camera motion, ground the simulator output relative to the reference image, and then update the answer. The failure analysis in the paper highlights the same bottlenecks: uninformative actions, spatially inconsistent simulator outputs, and useful imagined observations that are ignored or misused.
Experiments show that useful imagination requires both reliable simulation and a policy that controls simulator use. Forced tool use can hurt open-source VLMs, while Astra's learned agentic tool use improves spatial reasoning.
| Setting | Model | MMSI-Bench | MindCube-Tiny |
|---|---|---|---|
| Direct Answer | Qwen3-VL-8B-Instruct | 29.8 | 36.8 |
| Forced Tool-Use | Qwen3-VL-8B-Instruct | 28.6 | 27.6 |
| Agentic Tool-Use | Astra | 38.8 | 42.7 |
| Forced Tool-Use | Gemini-3-Flash + Astra-WM | 49.5 | 72.7 |
Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.
Analysis. The main result separates "having a simulator" from "knowing how to use it." Forced tool-use degrades Qwen3-VL-8B from 29.8 to 28.6 on MMSI-Bench and from 36.8 to 27.6 on MindCube, showing that imagined observations can introduce noise when the policy is not trained to choose actions or ground generated views. Astra instead learns an agentic workflow and improves the same backbone to 38.8 and 42.7.
| Training Recipe | Tool Rate | Calls / Row | PR. | Attr. | Mot. | MSR | All |
|---|---|---|---|---|---|---|---|
| Tool-gain only | 4.9 | 0.049 | 36.5 | 36.8 | 28.4 | 31.2 | 34.3 |
| Usage bonus only | 98.1 | 1.400 | 39.4 | 38.7 | 30.2 | 32.5 | 36.1 |
| Phase 1 only | 98.0 | 1.120 | 40.1 | 39.2 | 30.8 | 32.9 | 36.8 |
| Phase 1 -> Phase 2 | 61.5 | 0.780 | 42.3 | 41.0 | 32.1 | 33.6 | 38.8 |
Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.
Analysis. The two phases solve opposite failure modes. A sparse tool-gain reward collapses toward direct answering, with only a 4.9% tool-call rate. A dense usage bonus prevents collapse but over-corrects into near-universal simulator use. Phase 1 teaches valid tool interaction; Phase 2 changes the objective from "call the simulator" to "call it only when imagined evidence helps." This is why the strongest result uses fewer calls than the overuse baselines but achieves higher accuracy.
Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps the camera-centric gains while avoiding generated views when the original context is already sufficient.
Analysis. The workflow-mode ablation explains why agentic control matters at inference time. Forced tool-use helps camera-centric relations, where alternative viewpoints often resolve the question, but it can hurt object- and region-centric relations where the original context is already sufficient. Agentic tool-use keeps the camera-centric gains while reducing unnecessary imagination, improving positional-relation accuracy from 36.4 in direct-answer mode and 40.1 in forced mode to 42.3.
@misc{zhu2026thinkingimaginationagenticvisual,
title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators},
author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
year={2026},
eprint={2606.06476},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.06476},
}