We open-source the LLaVA-3D to facilitate future development of LMM in the community.
Training Code: Cook a SOTA model with our released training code
🤗 Checkpoints: Access pre-trained model checkpoints (7B)
🤗 LLaVA-3D Data: Explore training datasets for 2D and 3D
🎨 Live Demo: Try it out yourself!
Our LLaVA-3D exhibits powerful 3D understanding and reasoning capability. Based on 2D multi-view image observation, LLaVA-3D enables the user-friendly interaction with the 3D scene across various 3D understanding and reasoning tasks. It allows the users to just click on the 2D images or the video frame to simply obtain the corresponding 3D object caption or 3D bounding boxes.
LLaVA-3D could perform 2D Click-based 3D dense captioning, generating the corresponding object caption and 3D bounding box.
LLaVA-3D could perform 2D Click-based 3D question answering, now users could click on the 2D images and ask the question.
LLaVA-3D exhibits powerful 3D visual grounding capability, enabling accurate 3D bounding boxes output.