Thinking in Dynamics:
How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Yuzhi Huang*♠1 Kairun Wen*1 Rongxin Gao*1 Dongxuan Liu1 Yibin Lou3 Jie Wu2 Jing Xu7 Jian Zhang1 Zheng Yang1 Yunlong Lin1 Chenxin Li4 Panwang Pan1 Junbin Lu5 Jingyan Jiang6 Xinghao Ding1 Yue Huang† 1 Zhi Wang2
1XMU
2THU
3SUSTech
4CUHK
5UW
6SZTU
7JNU
*Equal Contribution †Corresponding author ♠Project lead
Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time. While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics" — perceiving, tracking and reasoning about spatio-temporal dynamics in evolving scenes? We introduce Dyn-Bench, a large-scale benchmark comprising 1k videos, 7k VQA pairs, and 3k dynamic object grounding pairs, systematically assessing MLLMs' ability to perceive, track, and reason about object motion, scene evolution, and camera motion in the physical 4D world.
- [2026.03.16] 📄 Thinking in Dynamics paper released on arXiv.
- [2026.03.01] 🚀 Dyn-Bench evaluation code and benchmark released.
- [2026.02.21] 🎉 Thinking in Dynamics has been accepted by CVPR 2026!
- Release
- Contents
- Dyn-Bench
- Results
- Run Your Own Evaluation
- Model List
- UniPixel Special Instructions
- Acknowledgement
- Citation
Overview: We introduce Dyn-Bench, a spatio-temporal dynamics reasoning benchmark built from diverse real-world and synthetic video datasets. Dyn-Bench comprises 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs, enabling robust and scalable evaluation of spatio-temporal understanding across three evaluation dimensions.
Evaluation Dimensions
| Category | Task Suffix | Description |
|---|---|---|
| Camera-Object | cameraqa, cameramask |
Reasoning about object dynamics relative to camera motion |
| Inter-Object | qa, objmask |
Reasoning about interactions and relative dynamics between objects |
| Object-Scene | sceneqa, scenemask |
Reasoning about how objects interact with and evolve within the scene |
Evaluation Metrics
- QA Accuracy: Answer matching accuracy rate for VQA tasks
- Mask J&F Score: Average of segmentation mask IoU (J) and boundary F-measure (F) for grounding tasks
Evaluation Setup: We probe general, spatial, and region-level MLLMs across all three evaluation dimensions, assessing both linguistic (VQA) and visual (mask grounding) dynamics understanding. We find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction.
Note: Conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches — including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM) — significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world.
Our benchmark is hosted on HuggingFace. You can access the benchmark data using:
# NOTE: pip install datasets
from datasets import load_dataset
dyn_bench = load_dataset("kairunwen/DynamicVerse")
print(dyn_bench)1. Environment Setup
conda create -n bench python=3.11
conda activate bench
# Install PyTorch (choose according to your CUDA version, see https://pytorch.org)
pip install -r requirements.txt
# Install flash-attn (recommended to use pre-compiled wheels)
# For Linux: https://github.com/Dao-AILab/flash-attention/releases
# For Windows: https://github.com/sdbds/flash-attention-for-windows/releases2. Clone Repository
git clone https://github.com/LilyYang0504/bench.git
cd benchEdit conf/config.yaml:
datasets:
repo_name: "Huggingface/DatasetsRepo"
datasets_path: "path/for/your/datasets/download"
model:
model_path: "path/to/model"
model_name: "Huggingface/ModelID"
download_path: "path/for/your/model/download"
device: "cuda"
torch_dtype: "bfloat16"
use_flash_attn: true
trust_remote_code: true
task: "all" # choices: all / qa / mask
evaluation:
boundary_threshold: 2
result_path: "results"Download datasets:
python download_datasets.pyDownload a single model:
python download_model.py --model "Huggingface/ModelID"Batch model download (edit conf/model_list.txt first):
python download_model.py --batchRun evaluation:
bash start_eval.shbench/
├── conf/
│ ├── config.yaml
│ └── model_list.txt
├── utils/
├── thirdparty/
├── download_datasets.py
├── download_model.py
├── eval.py
└── start_eval.sh
Models Supporting QA + Mask Tasks
| Model Family | HuggingFace Model ID |
|---|---|
| Sa2VA | ByteDance/Sa2VA-{x}B |
| Sa2VA-InternVL3 | ByteDance/Sa2VA-InternVL3-{x}B |
| Sa2VA-Qwen2.5-VL | ByteDance/Sa2VA-Qwen2_5-VL-{x}B |
| Sa2VA-Qwen3-VL | ByteDance/Sa2VA-Qwen3-VL-{x}B |
| UniPixel | PolyU-ChenLab/UniPixel-{x}B (requires additional installation) |
Models Supporting QA Tasks Only
| Model Family | HuggingFace Model ID |
|---|---|
| InternVL3 | OpenGVLab/InternVL3-{x}B |
| InternVL3.5 | OpenGVLab/InternVL3_5-{x}B |
| Qwen2.5-VL | Qwen/Qwen2.5-VL-{x}B-Instruct |
| Qwen3-VL | Qwen/Qwen3-VL-{x}B-Instruct |
| Qwen3-VL-MoE | Qwen/Qwen3-VL-235B-A22B-Instruct |
| LLaVA-OneVision | lmms-lab/LLaVA-One-Vision-1.5-{x}B-Instruct |
| SpaceR-SFT | RUBBISHLIKE/SpaceR-SFT-{x}B |
| VST | rayruiyang/VST-{x}B-RL |
| Spatial-SSRL | internlm/Spatial-SSRL-{x}B |
| SpatialLadder | hongxingli/SpatialLadder-{x}B |
Replace
{x}Bwith the actual model parameter size. Please check HuggingFace for available sizes.
UniPixel requires additional dependencies. See UniPixel GitHub:
cd bench
mkdir thirdparty
cd thirdparty
git clone https://github.com/PolyU-ChenLab/UniPixel.git
cd UniPixel
pip install -r requirements.txtWe gratefully acknowledge the open-source community and the authors of the video datasets and foundation models used in constructing Dyn-Bench. Our evaluation framework builds upon the excellent toolkits provided by the community for evaluating multimodal large language models.
If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)
@misc{huang2026thinkingdynamicsmultimodallarge,
title={Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World},
author={Yuzhi Huang and Kairun Wen and Rongxin Gao and Dongxuan Liu and Yibin Lou and Jie Wu and Jing Xu and Jian Zhang and Zheng Yang and Yunlong Lin and Chenxin Li and Panwang Pan and Junbin Lu and Jingyan Jiang and Xinghao Ding and Yue Huang and Zhi Wang},
year={2026},
eprint={2603.12746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.12746},
}