Thinking in Dynamics:
How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Yuzhi Huang^*♠1 Kairun Wen^*1 Rongxin Gao^*1 Dongxuan Liu¹ Yibin Lou³ Jie Wu² Jing Xu⁷ Jian Zhang¹ Zheng Yang¹ Yunlong Lin¹ Chenxin Li⁴ Panwang Pan¹ Junbin Lu⁵ Jingyan Jiang⁶ Xinghao Ding¹ Yue Huang^{† 1} Zhi Wang²

¹XMU ²THU ³SUSTech ⁴CUHK ⁵UW ⁶SZTU ⁷JNU

^*Equal Contribution ^†Corresponding author ♠Project lead

🌟 CVPR 2026 🌟

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time. While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics" — perceiving, tracking and reasoning about spatio-temporal dynamics in evolving scenes? We introduce Dyn-Bench, a large-scale benchmark comprising 1k videos, 7k VQA pairs, and 3k dynamic object grounding pairs, systematically assessing MLLMs' ability to perceive, track, and reason about object motion, scene evolution, and camera motion in the physical 4D world.

Release

[2026.03.16] 📄 Thinking in Dynamics paper released on arXiv.
[2026.03.01] 🚀 Dyn-Bench evaluation code and benchmark released.
[2026.02.21] 🎉 Thinking in Dynamics has been accepted by CVPR 2026!

Dyn-Bench

Overview: We introduce Dyn-Bench, a spatio-temporal dynamics reasoning benchmark built from diverse real-world and synthetic video datasets. Dyn-Bench comprises 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs, enabling robust and scalable evaluation of spatio-temporal understanding across three evaluation dimensions.

Evaluation Dimensions

Category	Task Suffix	Description
Camera-Object	`cameraqa`, `cameramask`	Reasoning about object dynamics relative to camera motion
Inter-Object	`qa`, `objmask`	Reasoning about interactions and relative dynamics between objects
Object-Scene	`sceneqa`, `scenemask`	Reasoning about how objects interact with and evolve within the scene

Evaluation Metrics

QA Accuracy: Answer matching accuracy rate for VQA tasks
Mask J&F Score: Average of segmentation mask IoU (J) and boundary F-measure (F) for grounding tasks

Results

Evaluation Setup: We probe general, spatial, and region-level MLLMs across all three evaluation dimensions, assessing both linguistic (VQA) and visual (mask grounding) dynamics understanding. We find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction.

Note: Conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches — including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM) — significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world.

Run Your Own Evaluation

Benchmark

Our benchmark is hosted on HuggingFace. You can access the benchmark data using:

# NOTE: pip install datasets

from datasets import load_dataset
dyn_bench = load_dataset("kairunwen/DynamicVerse")
print(dyn_bench)

Installation

1. Environment Setup

conda create -n bench python=3.11
conda activate bench

# Install PyTorch (choose according to your CUDA version, see https://pytorch.org)

pip install -r requirements.txt

# Install flash-attn (recommended to use pre-compiled wheels)
# For Linux:   https://github.com/Dao-AILab/flash-attention/releases
# For Windows: https://github.com/sdbds/flash-attention-for-windows/releases

2. Clone Repository

git clone https://github.com/LilyYang0504/bench.git
cd bench

Configuration

Edit conf/config.yaml:

datasets:
  repo_name: "Huggingface/DatasetsRepo"
  datasets_path: "path/for/your/datasets/download"

model:
  model_path: "path/to/model"
  model_name: "Huggingface/ModelID"
  download_path: "path/for/your/model/download"
  device: "cuda"
  torch_dtype: "bfloat16"
  use_flash_attn: true
  trust_remote_code: true

task: "all"  # choices: all / qa / mask

evaluation:
  boundary_threshold: 2

result_path: "results"

Evaluation

Download datasets:

python download_datasets.py

Download a single model:

python download_model.py --model "Huggingface/ModelID"

Batch model download (edit conf/model_list.txt first):

python download_model.py --batch

Run evaluation:

bash start_eval.sh

Project Structure

bench/
├── conf/
│   ├── config.yaml
│   └── model_list.txt
├── utils/
├── thirdparty/
├── download_datasets.py
├── download_model.py
├── eval.py
└── start_eval.sh

Model List

Models Supporting QA + Mask Tasks

Model Family	HuggingFace Model ID
Sa2VA	`ByteDance/Sa2VA-{x}B`
Sa2VA-InternVL3	`ByteDance/Sa2VA-InternVL3-{x}B`
Sa2VA-Qwen2.5-VL	`ByteDance/Sa2VA-Qwen2_5-VL-{x}B`
Sa2VA-Qwen3-VL	`ByteDance/Sa2VA-Qwen3-VL-{x}B`
UniPixel	`PolyU-ChenLab/UniPixel-{x}B` (requires additional installation)

Models Supporting QA Tasks Only

Model Family	HuggingFace Model ID
InternVL3	`OpenGVLab/InternVL3-{x}B`
InternVL3.5	`OpenGVLab/InternVL3_5-{x}B`
Qwen2.5-VL	`Qwen/Qwen2.5-VL-{x}B-Instruct`
Qwen3-VL	`Qwen/Qwen3-VL-{x}B-Instruct`
Qwen3-VL-MoE	`Qwen/Qwen3-VL-235B-A22B-Instruct`
LLaVA-OneVision	`lmms-lab/LLaVA-One-Vision-1.5-{x}B-Instruct`
SpaceR-SFT	`RUBBISHLIKE/SpaceR-SFT-{x}B`
VST	`rayruiyang/VST-{x}B-RL`
Spatial-SSRL	`internlm/Spatial-SSRL-{x}B`
SpatialLadder	`hongxingli/SpatialLadder-{x}B`

Replace {x}B with the actual model parameter size. Please check HuggingFace for available sizes.

UniPixel Special Instructions

UniPixel requires additional dependencies. See UniPixel GitHub:

cd bench
mkdir thirdparty
cd thirdparty
git clone https://github.com/PolyU-ChenLab/UniPixel.git
cd UniPixel
pip install -r requirements.txt

Acknowledgement

We gratefully acknowledge the open-source community and the authors of the video datasets and foundation models used in constructing Dyn-Bench. Our evaluation framework builds upon the excellent toolkits provided by the community for evaluating multimodal large language models.

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)

@misc{huang2026thinkingdynamicsmultimodallarge,
      title={Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World}, 
      author={Yuzhi Huang and Kairun Wen and Rongxin Gao and Dongxuan Liu and Yibin Lou and Jie Wu and Jing Xu and Jian Zhang and Zheng Yang and Yunlong Lin and Chenxin Li and Panwang Pan and Junbin Lu and Jingyan Jiang and Xinghao Ding and Yue Huang and Zhi Wang},
      year={2026},
      eprint={2603.12746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.12746}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking in Dynamics:
How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

🌟 CVPR 2026 🌟

Release

Contents

Dyn-Bench

Results

Run Your Own Evaluation

Benchmark

Installation

Configuration

Evaluation

Project Structure

Model List

UniPixel Special Instructions

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
conf		conf
docs		docs
utils		utils
.gitignore		.gitignore
README.md		README.md
download_datasets.py		download_datasets.py
download_model.py		download_model.py
eval.py		eval.py
requirements.txt		requirements.txt
start_eval.sh		start_eval.sh

Folders and files

Latest commit

History

Repository files navigation

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

🌟 CVPR 2026 🌟

Release

Contents

Dyn-Bench

Results

Run Your Own Evaluation

Benchmark

Installation

Configuration

Evaluation

Project Structure

Model List

UniPixel Special Instructions

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Thinking in Dynamics:
How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Packages