G<sup>2</sup>TAM: Geometry Grounded Track Anything Model

TL;DR: G²TAM supports various prompts—including text, point, and box prompts—to perform joint 3D reconstruction and spatial–temporal consistent instance segmentation, enabling promptable instance tracking in 3D space.

Demo

Abstract

Human spatial understanding arises from jointly perceiving geometry and semantics, enabling consistent object identification and localization across viewpoints and time. Current video segmentation models depend on explicit object appearance memory banks for instance tracking, yet they remain vulnerable to large viewpoint changes and long-term occlusions. Leveraging the spatial consistency afforded by modern feed-forward 3D reconstruction models, we propose the G²TAM, a unified framework for promptable instance tracking in 3D using only unordered RGB images or videos. G²TAM employs spatially aligned geometric representations as implicit memory, ensuring stable instance identity and localization across frames and views. At its core is a cross-modal spatial encoder that integrates visual and textual prompts into a shared geometric space, enabling end-to-end spatial reconstruction and instance-consistent mask prediction. To support training and evaluation, we construct InsTrack, a large-scale dataset with a dedicated validation split for benchmarking. Extensive experiments show that G²TAM delivers strong cross-view consistency, promptable instance spatial tracking, video object segmentation and spatial reconstruction, establishing a foundation for interactive, geometry-grounded spatial reasoning.

Architecture

At the core of G²TAM is the idea of Geometry as Implicit Memory. To our knowledge, we are the first work to use the spatially aligned geometric representation as the underlying persistent memory for object identity. Instead of relying on explicit temporal storage, G²TAM embeds text/visual prompts directly into a unified geometric semantic representation through a simple yet highly effective cross-modal spatial encoder.

Promptable Instance Spatial Tracking

Promptable Instance Spatial Tracking (PIST), designed to evaluate a model's instance segmentation spatial consistency across views in the static scenario. Given an input prompt—such as a point, bounding box on any view, or a referring text expression—the goal is to produce spatially consistent masks for the corresponding instance across all views.

Abilities

Loading video...

Video Reconstruction

DA3 recovers the visual space from any number of views, covering from single view to multiple views. This demo illustrates the ability of DA3 to recover the visual space from a difficult video.

Loading video...

SLAM for Large-Scale Scenes

Accurate visual geometry estimation improves SLAM performance. Quantitative results show that simply replacing VGGT in VGGT-Long with DA3 (DA3-Long) significantly reduces drift in large-scale environments, even better than COLMAP, which takes more 48 hours to complete.

Loading video...

Feed-Forward 3D Gaussians Estimation

By freezing the entire backbone and training a DPT head to predict 3DGS parameters, our model achieves very strong and generalizable novel view synthesis capability.

Loading video...

Spatial Perception from Multiple Cameras

Given several images of different viewpoints from a vehicle (even without overlap), DA3 estimates stable and fusible depth maps, enhancing autonomous vehicles' environmental understanding.

Citation


@article{depthanything3,
  title={Depth Anything 3: recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
  journal={arXiv preprint arXiv:2511.10647},
  year={2025}
}

G2TAM

Geometry Grounded Track Anything Model

Demo

Abstract

Architecture

Promptable Instance Spatial Tracking

Abilities

Video Reconstruction

SLAM for Large-Scale Scenes

Feed-Forward 3D Gaussians Estimation

Spatial Perception from Multiple Cameras

Interactive Examples

Citation

G²TAM