Skip to content

Dynamics-X/Thinking-in-Dynamics

Repository files navigation

Thinking in Dynamics:
How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World


arXiv Website HF Dataset: Dyn_Bench

Yuzhi Huang*♠1   Kairun Wen*1   Rongxin Gao*1   Dongxuan Liu1   Yibin Lou3   Jie Wu2   Jing Xu7   Jian Zhang1   Zheng Yang1   Yunlong Lin1   Chenxin Li4   Panwang Pan1   Junbin Lu5   Jingyan Jiang6   Xinghao Ding1   Yue Huang† 1   Zhi Wang2

1XMU XMU    2THU THU    3SUSTech SUSTech    4CUHK CUHK    5UW UW    6SZTU SZTU    7JNU JNU

*Equal Contribution     Corresponding author     ♠Project lead

🌟 CVPR 2026 🌟

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time. While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics" — perceiving, tracking and reasoning about spatio-temporal dynamics in evolving scenes? We introduce Dyn-Bench, a large-scale benchmark comprising 1k videos, 7k VQA pairs, and 3k dynamic object grounding pairs, systematically assessing MLLMs' ability to perceive, track, and reason about object motion, scene evolution, and camera motion in the physical 4D world.

Release

  • [2026.03.16] 📄 Thinking in Dynamics paper released on arXiv.
  • [2026.03.01] 🚀 Dyn-Bench evaluation code and benchmark released.
  • [2026.02.21] 🎉 Thinking in Dynamics has been accepted by CVPR 2026!

Contents

Dyn-Bench

Overview: We introduce Dyn-Bench, a spatio-temporal dynamics reasoning benchmark built from diverse real-world and synthetic video datasets. Dyn-Bench comprises 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs, enabling robust and scalable evaluation of spatio-temporal understanding across three evaluation dimensions.

Evaluation Dimensions

Category Task Suffix Description
Camera-Object cameraqa, cameramask Reasoning about object dynamics relative to camera motion
Inter-Object qa, objmask Reasoning about interactions and relative dynamics between objects
Object-Scene sceneqa, scenemask Reasoning about how objects interact with and evolve within the scene

Evaluation Metrics

  • QA Accuracy: Answer matching accuracy rate for VQA tasks
  • Mask J&F Score: Average of segmentation mask IoU (J) and boundary F-measure (F) for grounding tasks

Results

Evaluation Setup: We probe general, spatial, and region-level MLLMs across all three evaluation dimensions, assessing both linguistic (VQA) and visual (mask grounding) dynamics understanding. We find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction.

Note: Conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches — including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM) — significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world.

Run Your Own Evaluation

Benchmark

Our benchmark is hosted on HuggingFace. You can access the benchmark data using:

# NOTE: pip install datasets

from datasets import load_dataset
dyn_bench = load_dataset("kairunwen/DynamicVerse")
print(dyn_bench)

Installation

1. Environment Setup

conda create -n bench python=3.11
conda activate bench

# Install PyTorch (choose according to your CUDA version, see https://pytorch.org)

pip install -r requirements.txt

# Install flash-attn (recommended to use pre-compiled wheels)
# For Linux:   https://github.com/Dao-AILab/flash-attention/releases
# For Windows: https://github.com/sdbds/flash-attention-for-windows/releases

2. Clone Repository

git clone https://github.com/LilyYang0504/bench.git
cd bench

Configuration

Edit conf/config.yaml:

datasets:
  repo_name: "Huggingface/DatasetsRepo"
  datasets_path: "path/for/your/datasets/download"

model:
  model_path: "path/to/model"
  model_name: "Huggingface/ModelID"
  download_path: "path/for/your/model/download"
  device: "cuda"
  torch_dtype: "bfloat16"
  use_flash_attn: true
  trust_remote_code: true

task: "all"  # choices: all / qa / mask

evaluation:
  boundary_threshold: 2

result_path: "results"

Evaluation

Download datasets:

python download_datasets.py

Download a single model:

python download_model.py --model "Huggingface/ModelID"

Batch model download (edit conf/model_list.txt first):

python download_model.py --batch

Run evaluation:

bash start_eval.sh

Project Structure

bench/
├── conf/
│   ├── config.yaml
│   └── model_list.txt
├── utils/
├── thirdparty/
├── download_datasets.py
├── download_model.py
├── eval.py
└── start_eval.sh

Model List

Models Supporting QA + Mask Tasks

Model Family HuggingFace Model ID
Sa2VA ByteDance/Sa2VA-{x}B
Sa2VA-InternVL3 ByteDance/Sa2VA-InternVL3-{x}B
Sa2VA-Qwen2.5-VL ByteDance/Sa2VA-Qwen2_5-VL-{x}B
Sa2VA-Qwen3-VL ByteDance/Sa2VA-Qwen3-VL-{x}B
UniPixel PolyU-ChenLab/UniPixel-{x}B (requires additional installation)

Models Supporting QA Tasks Only

Model Family HuggingFace Model ID
InternVL3 OpenGVLab/InternVL3-{x}B
InternVL3.5 OpenGVLab/InternVL3_5-{x}B
Qwen2.5-VL Qwen/Qwen2.5-VL-{x}B-Instruct
Qwen3-VL Qwen/Qwen3-VL-{x}B-Instruct
Qwen3-VL-MoE Qwen/Qwen3-VL-235B-A22B-Instruct
LLaVA-OneVision lmms-lab/LLaVA-One-Vision-1.5-{x}B-Instruct
SpaceR-SFT RUBBISHLIKE/SpaceR-SFT-{x}B
VST rayruiyang/VST-{x}B-RL
Spatial-SSRL internlm/Spatial-SSRL-{x}B
SpatialLadder hongxingli/SpatialLadder-{x}B

Replace {x}B with the actual model parameter size. Please check HuggingFace for available sizes.

UniPixel Special Instructions

UniPixel requires additional dependencies. See UniPixel GitHub:

cd bench
mkdir thirdparty
cd thirdparty
git clone https://github.com/PolyU-ChenLab/UniPixel.git
cd UniPixel
pip install -r requirements.txt

Acknowledgement

We gratefully acknowledge the open-source community and the authors of the video datasets and foundation models used in constructing Dyn-Bench. Our evaluation framework builds upon the excellent toolkits provided by the community for evaluating multimodal large language models.

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)

@misc{huang2026thinkingdynamicsmultimodallarge,
      title={Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World}, 
      author={Yuzhi Huang and Kairun Wen and Rongxin Gao and Dongxuan Liu and Yibin Lou and Jie Wu and Jing Xu and Jian Zhang and Zheng Yang and Yunlong Lin and Chenxin Li and Panwang Pan and Junbin Lu and Jingyan Jiang and Xinghao Ding and Yue Huang and Zhi Wang},
      year={2026},
      eprint={2603.12746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.12746}, 
}

About

[CVPR 2026]"Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors