LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

Abstract

Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.

Overview of LLaVA-3D

Block (a) shows that LLaVA-3D could perform both 2D and 3D vision-language tasks. The left block (b) shows that compared with previous 3D LMMs, our LLaVA-3D achieves state-of-the-art performance across a wide range of 3D benchmarks while maintaining a comparable performance on various 2D benchmarks compared with LLaVA-1.5. The middle block (c) demonstrates that LLaVA-3D is built on the 2D LMM: LLaVA, and leverages 3D patches to endow it with 3D spatial awareness, enabling it to perform various 3D vision-and-language tasks in the physical world. The right blocks (d) and (e) highlights the significantly faster convergence and inference speeds of LLaVA-3D compared to existing 3D LMMs.

LLaVA-3D Architecture

LLaVA-3D Architecture. Based on LLaVA, we directly add the corresponding 3D position embeddings to 2D patch visual tokens of multi-view images to construct the 3D Patches, then the 3D Patches will undergo 3D pooling and be sent into the projection layer of LLaVA to map into the LLM space and align with the LLM using 3D-visual-language data.

Open-source Release

We open-source the LLaVA-3D to facilitate future development of LMM in the community.

Training Code: Cook a SOTA model with our released training code
🤗 Checkpoints: Access pre-trained model checkpoints (7B)
🤗 LLaVA-3D Data: Explore training datasets for 2D and 3D
🎨 Live Demo: Try it out yourself!

Multimodal Capabilities

Our LLaVA-3D exhibits powerful 3D understanding and reasoning capability. Based on 2D multi-view image observation, LLaVA-3D enables the user-friendly interaction with the 3D scene across various 3D understanding and reasoning tasks. It allows the users to just click on the 2D images or the video frame to simply obtain the corresponding 3D object caption or 3D bounding boxes.

LLaVA-3D could perform 2D Click-based 3D dense captioning, generating the corresponding object caption and 3D bounding box.

LLaVA-3D could perform 2D Click-based 3D question answering, now users could click on the 2D images and ask the question.

LLaVA-3D exhibits powerful 3D visual grounding capability, enabling accurate 3D bounding boxes output.