Skip to content

GVCLab/MLLM-4D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence✨

ArXiv | PDF | Model | Dataset

1 University of Macau, 2 GVC Lab, Great Bay University, 3 Xi’an Jiaotong University, 4 CUHKSZ

TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space. Read our paper for more details.

Teaser

⚙️ Setup

1. Clone MLLM-4D

git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4D

2. Setup Environments

MLLM-4D is tested with CUDA 12.1/12.8 on H100.

conda create -n mllm4d python=3.10
conda activate mllm4d 
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3. Download Pretrained Models

python scripts/download_ckpt_hf.py

4. Download the Datasets

python scripts/download_dataset_hf.py

💫 Inference

1. Inference Demo

# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT

🚂 Training

1. Supervised Fine-Tuning Using Our MLLM4D-2M Dataset.

Please set up the parameters in __init__.py and sft_qwen3-vl-8b_mllm4d-2M.sh.

cd qwen-vl-sft
bash scripts/sft_qwen3-vl-8b_mllm4d-2M.sh

2. Cold-Start Fine-Tuning Using Our Cold-Start Data.

Please set up the parameters in __init__.py and sft_mllm-4d-sft_cold-start.sh.

cd qwen-vl-sft
bash scripts/sft_mllm-4d-sft_cold-start.sh

3. Reinforcement Fine-Tuning Using Our MLLM4D-R1-30k Dataset.

📋 TODO

  • We have completed the code and data cleanup. Release coming soon!
  • RFT Stage: Release the MLLM4D-R1-30k dataset and Reinforcement Fine-Tuning code!
  • [2026.03.04] 🔥 Cold-Start Phase: Release the Cold-Start Data and Cold-Start Fine-Tuning code!
  • [2026.03.04] 🔥 SFT Stage: Release the MLLM4D-2M dataset and Supervised Fine-Tuning code!
  • [2026.02.28] 🔥 Release the arXiv paper, inference demo, and pretrained weights!

🤗 Acknowledgement

If you find MLLM-4D useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks🙏!

Our work is built upon Qwen3-VL, thanks to their invaluable contributions!

📚 Citation

If you find the work useful, please consider citing:

@article{yin2026mllm4d,
    title={MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence},
    author={Yin, Xingyilang and Li, Chengzhengxu and Chang, Jiahao and Pun, Chi-Man and Cun, Xiaodong},
    journal={arXiv preprint arXiv:2603.00515},
    year={2026}
}

About

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors