3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid

@inproceedings{deng20253dllava,
  title={3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer},
  author={Deng, Jiajun and He, Tianyu and Jiang, Li and Wang, Tianyu and Dayoub, Feras and Reid, Ian},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Overview

3D-LLaVA (CVPR 2025) is 3D Large Multimodal Model that takes point clouds and text instruction as input to perform VQA, Dense Captioning and 3D Referring Segmentation. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description.

Environment

We provide the Docker Image to run our 3D-LLaVA. Please run the following code to pull the docker image:

docker pull djiajun1206/3d-llava-slim

Data

We conduct experiments with the scans data from Scannet, as well as the text description from ScanRefer, ScanQA, SQA3D, ReferIt3D and Multi3DRefer. To enable conventiently getting access to the data, we provide the processed data. The data are supposed to be placed in ./playground, and the data structure is as follows:

3D-LLaVA # project root
|── playground
|   |── data
│   |   ├── scannet
│   |   │   ├── super_points
|   │   │   ├── train
|   │   │   ├── val
|   │   │   └── scannet_axis_align_matrix_trainval.pkl
│   |   ├── train_info
│   │   |   ├── scanqa_train_3d_llava.json
│   │   |   ├── sqa3d_train_3d_llava.json
│   │   |   ├── scan2cap_train_3d_llava.json
│   │   |   ├── ...
│   │   └── eval_info
│   │   |   ├── scanqa
│   │   |   ├── sqa3d
│   │   |   ├── densecap_scanrefer
│   │   |   ├── ...

Training

We exploit LoRA tuning by default. Please train the 3D-LLaVA with:

./scripts/train/finetune-3d-llava-lora.sh

Evaluation

We provide the scripts to evaluate our model on ScanQA, SQA3D, Scan2Cap, ScanRefer, Multi3DRefer. Please run:

./scripts/eval/multigpu_eval_sqa3d.sh

./scripts/eval/multigpu_eval_scanqa.sh

./scripts/eval/multigpu_eval_scan2cap.sh

./scripts/eval/multigpu_eval_scanrefer.sh

./scripts/eval/multigpu_eval_multi3drefer.sh

Acknowledgements

Thanks to the following great repositories: LLaVA, PonderV2, OneFormer3d.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
checkpoints/pc_pretrained		checkpoints/pc_pretrained
docs		docs
libs		libs
llava		llava
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Overview

Environment

Data

Training

Evaluation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

djiajunustc/3D-LLaVA

Folders and files

Latest commit

History

Repository files navigation

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Overview

Environment

Data

Training

Evaluation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages