Skip to content

djiajunustc/3D-LLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

arXiv checkpoint

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid

@inproceedings{deng20253dllava,
  title={3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer},
  author={Deng, Jiajun and He, Tianyu and Jiang, Li and Wang, Tianyu and Dayoub, Feras and Reid, Ian},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Overview

Dialogue_Teaser
3D-LLaVA (CVPR 2025) is 3D Large Multimodal Model that takes point clouds and text instruction as input to perform VQA, Dense Captioning and 3D Referring Segmentation. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description.

Environment

We provide the Docker Image to run our 3D-LLaVA. Please run the following code to pull the docker image:

docker pull djiajun1206/3d-llava-slim

Data

We conduct experiments with the scans data from Scannet, as well as the text description from ScanRefer, ScanQA, SQA3D, ReferIt3D and Multi3DRefer. To enable conventiently getting access to the data, we provide the processed data. The data are supposed to be placed in ./playground, and the data structure is as follows:

3D-LLaVA # project root
|── playground
|   |── data
│   |   ├── scannet
│   |   │   ├── super_points
|   │   │   ├── train
|   │   │   ├── val
|   │   │   └── scannet_axis_align_matrix_trainval.pkl
│   |   ├── train_info
│   │   |   ├── scanqa_train_3d_llava.json
│   │   |   ├── sqa3d_train_3d_llava.json
│   │   |   ├── scan2cap_train_3d_llava.json
│   │   |   ├── ...
│   │   └── eval_info
│   │   |   ├── scanqa
│   │   |   ├── sqa3d
│   │   |   ├── densecap_scanrefer
│   │   |   ├── ...

Training

We exploit LoRA tuning by default. Please train the 3D-LLaVA with:

./scripts/train/finetune-3d-llava-lora.sh

Evaluation

We provide the scripts to evaluate our model on ScanQA, SQA3D, Scan2Cap, ScanRefer, Multi3DRefer. Please run:

./scripts/eval/multigpu_eval_sqa3d.sh

./scripts/eval/multigpu_eval_scanqa.sh

./scripts/eval/multigpu_eval_scan2cap.sh

./scripts/eval/multigpu_eval_scanrefer.sh

./scripts/eval/multigpu_eval_multi3drefer.sh

Acknowledgements

Thanks to the following great repositories: LLaVA, PonderV2, OneFormer3d.

About

[CVPR 2025] 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published