GitHub - GVCLab/MLLM-4D: MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

✨MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence✨

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space. Read our paper for more details.

⚙️ Setup

1. Clone MLLM-4D

git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4D

2. Setup Environments

MLLM-4D is tested with CUDA 12.1/12.8 on H100.

conda create -n mllm4d python=3.10
conda activate mllm4d 
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3. Download Pretrained Models

python scripts/download_ckpt_hf.py

4. Download the Datasets

python scripts/download_dataset_hf.py

💫 Inference

1. Inference Demo

# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT

🚂 Training

1. Supervised Fine-Tuning Using Our MLLM4D-2M Dataset.

Please set up the parameters in __init__.py and sft_qwen3-vl-8b_mllm4d-2M.sh.

cd qwen-vl-sft
bash scripts/sft_qwen3-vl-8b_mllm4d-2M.sh

2. Cold-Start Fine-Tuning Using Our Cold-Start Data.

Please set up the parameters in __init__.py and sft_mllm-4d-sft_cold-start.sh.

cd qwen-vl-sft
bash scripts/sft_mllm-4d-sft_cold-start.sh

3. Reinforcement Fine-Tuning Using Our MLLM4D-R1-30k Dataset.

📋 TODO

We have completed the code and data cleanup. Release coming soon!
RFT Stage: Release the MLLM4D-R1-30k dataset and Reinforcement Fine-Tuning code!
[2026.03.04] 🔥 Cold-Start Phase: Release the Cold-Start Data and Cold-Start Fine-Tuning code!
[2026.03.04] 🔥 SFT Stage: Release the MLLM4D-2M dataset and Supervised Fine-Tuning code!
[2026.02.28] 🔥 Release the arXiv paper, inference demo, and pretrained weights!

🤗 Acknowledgement

If you find MLLM-4D useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks🙏!

Our work is built upon Qwen3-VL, thanks to their invaluable contributions!

📚 Citation

If you find the work useful, please consider citing:

@article{yin2026mllm4d,
    title={MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence},
    author={Yin, Xingyilang and Li, Chengzhengxu and Chang, Jiahao and Pun, Chi-Man and Cun, Xiaodong},
    journal={arXiv preprint arXiv:2603.00515},
    year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
qwen-vl-sft		qwen-vl-sft
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence✨

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

⚙️ Setup

1. Clone MLLM-4D

2. Setup Environments

3. Download Pretrained Models

4. Download the Datasets

💫 Inference

1. Inference Demo

🚂 Training

1. Supervised Fine-Tuning Using Our MLLM4D-2M Dataset.

2. Cold-Start Fine-Tuning Using Our Cold-Start Data.

3. Reinforcement Fine-Tuning Using Our MLLM4D-R1-30k Dataset.

📋 TODO

🤗 Acknowledgement

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence✨

Xingyilang Yin*1,2, Chengzhengxu Li*3, Jiahao Chang4, Chi-Man Pun1,📫, Xiaodong Cun2,📫

1 University of Macau, 2 GVC Lab, Great Bay University, 3 Xi’an Jiaotong University, 4 CUHKSZ

⚙️ Setup

1. Clone MLLM-4D

2. Setup Environments

3. Download Pretrained Models

4. Download the Datasets

💫 Inference

1. Inference Demo

🚂 Training

1. Supervised Fine-Tuning Using Our MLLM4D-2M Dataset.

2. Cold-Start Fine-Tuning Using Our Cold-Start Data.

3. Reinforcement Fine-Tuning Using Our MLLM4D-R1-30k Dataset.

📋 TODO

🤗 Acknowledgement

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

Packages