We propose ReGround3D consisting of a visual-centric reasoning module and a 3D grounding module with geometry-enhanced look-back.
The visual-centric reasoning module performs joint reasoning of language instruction and visual scene, and predicts a special <LOC> token representing the grounding information.
The 3D grounding module looks back to the original 3D scene with comprehensive geometry information and fine-grained details. It takes the hidden embedding of the <LOC> token
containing grounding-related information from the 3D features, and eventually predicts the 3D locations of the target objects.
Furthermore, we propose Chain-of-Grounding mechanism (CoG), a chain of interleaved reasoning and grounding steps, to further synergize the grounding and
reasoning capability for the 3D reasoning gruonding task.