GitHub - hanxunyu/Inst3D-LMM: [CVPR 2025 Highlight🔥] Official code repository for "Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning"

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
[CVPR 2025 Highlight ⭐]

This is the official implementation of "Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning".

All results of our Inst3D-LMM are evaluated on the same model without fine-tuning on specific tasks.

📰 News

Apr. 5th, 2025: Inst3D-LMM is accepted by CVPR 2025 (Highlight, 2.9%)!
Mar. 1st, 2025: Paper is available at arXiv. ☕️
Feb. 27th, 2025: We released our code! Paper is coming soon. Please stay tuned! ☕️

🔍 Abstract

Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks.

🛠️ Preparation

Prepare the environment:

conda create -n inst3d-lmm python=3.8 # create a virtual environment
conda activate inst3d-lmm # activate it
bash requirements.sh # installation requirements
pip install -e . # install current repository in editable mode
python -m paths --mk  # create folder structure

Download LLM and other foundation models backbone:
- We use Vicuna-7B v1.5 (Hugging Face), the vision encoder from CLIP-ViT-L/14-336px (Hugging Face) and the ViT-H-based SAM (Hugging Face);
Dataset Preprocessing:
- Download the full ScanNetv2 dataset and original ScanRefer, Multi3DRefer, ScanQA and Scan2Cap to annotations/;
- run bash scripts/preprocess_dataset.sh
Our system messages, instructions, and prompts are provided at instruction_templates/.

🚀 Training

Step1: Instance-level 3D feature extraction (corresponding to the folder 3d_feature_extraction):

We use Mask3D (the model trained on the ScanNet200 training set) to obtain segmented 3D proposals in a class-agnostic manner;
Then we use Uni3D (the pre-trained model uni3d-g) to extract 3D instance-level features ;
run bash scripts/3d_feature_extraction.sh.

Step2: Multi-view 2D feature extraction (corresponding to the folder 2d_feature_extraction):

Based on the 3D instance-level segmentation results, we use SAM and CLIP to extract multi-view 2D features for each 3D instance;
run bash scripts/2d_feature_extraction.sh.

Step3: End-to-end multi-task instruction-tuning:

The code for training and evaluating the model is provided at /run/train.py and /run/eval.py;
Modify train_tag in scripts/run_train.sh to specify the datasets used for end-to-end joint training. You can try different combination of training datasets or add customized datasets as you want;
run scripts/run_train.sh.

🤖 Inference

Modify evaluate=True and pretrained_checkpoint="outputs/ckpt_latest.pth" in /scripts/run_eval.sh;
Modify val_tag in scripts/run_eval.sh to specify the datasets used for evaluation as you want;
run scripts/run_eval.sh.

😊 Acknowledgement

We are grateful for the open-source contributions of other projects:

🖊️ Citation

If you find our Inst3D-LMM useful for your research, please consider giving this repository a star and citing our paper as follows:

@misc{Inst3D-LMM,
    title={Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning}, 
    author={Hanxun Yu and Wentong Li and Song Wang and Junbo Chen and Jianke Zhu},
    year={2025},
    eprint={2503.00513},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.00513}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
2d_feature_extraction		2d_feature_extraction
3d_feature_extraction		3d_feature_extraction
assets		assets
dataset		dataset
dataset_preprocess		dataset_preprocess
instruction_templates		instruction_templates
models		models
run		run
scripts		scripts
tests		tests
utils		utils
README.md		README.md
paths.py		paths.py
pytest.ini		pytest.ini
requirements.sh		requirements.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
[CVPR 2025 Highlight ⭐]

📰 News

🔍 Abstract

🛠️ Preparation

🚀 Training

🤖 Inference

😊 Acknowledgement

🖊️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hanxunyu/Inst3D-LMM

Folders and files

Latest commit

History

Repository files navigation

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning [CVPR 2025 Highlight ⭐]

📰 News

🔍 Abstract

🛠️ Preparation

🚀 Training

🤖 Inference

😊 Acknowledgement

🖊️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
[CVPR 2025 Highlight ⭐]

Packages