Xingyilang Yin*1,2, Chengzhengxu Li*3, Jiahao Chang4, Chi-Man Pun1,📫, Xiaodong Cun2,📫
TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space. Read our paper for more details.
git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4DMLLM-4D is tested with CUDA 12.1/12.8 on H100.
conda create -n mllm4d python=3.10
conda activate mllm4d
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whlpython scripts/download_ckpt_hf.pypython scripts/download_dataset_hf.py# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFTPlease set up the parameters in __init__.py and sft_qwen3-vl-8b_mllm4d-2M.sh.
cd qwen-vl-sft
bash scripts/sft_qwen3-vl-8b_mllm4d-2M.shPlease set up the parameters in __init__.py and sft_mllm-4d-sft_cold-start.sh.
cd qwen-vl-sft
bash scripts/sft_mllm-4d-sft_cold-start.sh- We have completed the code and data cleanup. Release coming soon!
- RFT Stage: Release the
MLLM4D-R1-30kdataset andReinforcement Fine-Tuning code! - [2026.03.04] 🔥 Cold-Start Phase: Release the
Cold-Start DataandCold-Start Fine-Tuning code! - [2026.03.04] 🔥 SFT Stage: Release the
MLLM4D-2Mdataset andSupervised Fine-Tuning code! - [2026.02.28] 🔥 Release the
arXiv paper,inference demo, andpretrained weights!
If you find MLLM-4D useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks🙏!
Our work is built upon Qwen3-VL, thanks to their invaluable contributions!
If you find the work useful, please consider citing:
@article{yin2026mllm4d,
title={MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence},
author={Yin, Xingyilang and Li, Chengzhengxu and Chang, Jiahao and Pun, Chi-Man and Cun, Xiaodong},
journal={arXiv preprint arXiv:2603.00515},
year={2026}
}
