TL;DR: G2TAM supports various prompts—including text, point, and box prompts—to perform joint 3D reconstruction and spatial–temporal consistent instance segmentation, enabling promptable instance tracking in 3D space.
Demo
TL;DR: G2TAM supports various prompts—including text, point, and box prompts—to perform joint 3D reconstruction and spatial–temporal consistent instance segmentation, enabling promptable instance tracking in 3D space.
Human spatial understanding arises from jointly perceiving geometry and semantics, enabling consistent object identification and localization across viewpoints and time. Current video segmentation models depend on explicit object appearance memory banks for instance tracking, yet they remain vulnerable to large viewpoint changes and long-term occlusions. Leveraging the spatial consistency afforded by modern feed-forward 3D reconstruction models, we propose the G2TAM, a unified framework for promptable instance tracking in 3D using only unordered RGB images or videos. G2TAM employs spatially aligned geometric representations as implicit memory, ensuring stable instance identity and localization across frames and views. At its core is a cross-modal spatial encoder that integrates visual and textual prompts into a shared geometric space, enabling end-to-end spatial reconstruction and instance-consistent mask prediction. To support training and evaluation, we construct InsTrack, a large-scale dataset with a dedicated validation split for benchmarking. Extensive experiments show that G2TAM delivers strong cross-view consistency, promptable instance spatial tracking, video object segmentation and spatial reconstruction, establishing a foundation for interactive, geometry-grounded spatial reasoning.
At the core of G2TAM is the idea of Geometry as Implicit Memory. To our knowledge, we are the first work to use the spatially aligned geometric representation as the underlying persistent memory for object identity. Instead of relying on explicit temporal storage, G2TAM embeds text/visual prompts directly into a unified geometric semantic representation through a simple yet highly effective cross-modal spatial encoder.
Promptable Instance Spatial Tracking (PIST), designed to evaluate a model's instance segmentation spatial consistency across views in the static scenario. Given an input prompt—such as a point, bounding box on any view, or a referring text expression—the goal is to produce spatially consistent masks for the corresponding instance across all views.
@article{depthanything3,
title={Depth Anything 3: recovering the visual space from any views},
author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
journal={arXiv preprint arXiv:2511.10647},
year={2025}
}