MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

✨ Introduction

Recent advances in multimodal large language models (MLLMs) have brought remarkable progress in video understanding.
However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios.

🎬 MT-Video-Bench fills this gap.
It emphasizes cross-scene reasoning, long-range dependencies, and interactive adaptability, thereby aligning closely with real-world application demands.

Figure 1. Illustration of multi-turn dialogues under single-scene and cross-scene settings. The evaluated questions corresponding to tasks are marked with underlining, and the scenes involved in the entire multi-turn dialogues are marked with blue dotted boxes.

🚀 Overview

MT-Video-Bench‘s information:

📌 135 videos from 5 major categories & 23 subcategories
💬 1,000 dialogues (each with 5–8 turns) and 5,887 QA pairs for evaluating six core abilities
- Object Reference
- Memory Recall
- Content Summary
- Answer Refusal
- Topic Shifting
- Proactive Interaction
🧮 Long-Video Evaluation: durations up to 20 minutes
🧠 Very challenging, even 🥇 high-performance model achieving only ⚠️ 76.95 % overall accuracy, revealing the considerable difficulty of this dataset.

Figure 2. It covers a broad range of topics across five main categories: Movie, TV, Sports, Knowledge, and Life Record, each with multiple sub-topics, ensuring a diverse and balanced data distribution.

MT-Video-Bench is a new multi-turn video understanding benchmark that lets you easily compare our dataset with existing video-language benchmarks.

Figure 3. Comparison with other benchmarks. Avg. Q/V - the average number of QA pairs per video. \textbf{Long}: whether the average video length is greater than 10 minutes. Q. Div. and A. Div.: the lexical diversity of questions and answers. Q. Len. and A. Len.: the average number of tokens of questions and answers, computed with the LLaMA-3.1-8B tokenizer.

🧩 Pipeline

A glance at how MT-Video-Bench was built👇

🔎 Video Collection & Single-Scene Splitting: Manually collect videos → split into short clips using PySceneDetect → generate captions for each clip → merge related clips based on captions to form coherent single-scene videos.
🧾 Cross-Scene Video Merging: Extract key frames → perform object detection → build a dynamic object memory bank → retrieve and merge segments sharing common objects or themes.
📦 Multi-Turn Dialogue Generation: Use Gemini 2.5 to automatically generate single-scene and cross-scene multi-turn dialogues → select the most suitable task for each scene → design cross-scene questions with an object-centered approach.
🚦 Human Quality Control: Remove cases with information leakage → manually verify QA alignment, factual correctness, and difficulty → ensure high-quality, contextually coherent multi-turn dialogues.

Figure 4. Data construction and refinement pipeline of MT-Video-Bench.

🌟 License

Our dataset is under the CC-BY-NC-SA-4.0 license.

⚠️ If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors.

If the original authors of the related works still believe that the videos should be removed, please contact ynpan24@m.fudan.edu.cn or directly raise an issue.

📊 Evaluation Results

We evaluate both closed- and open-source MLLMs on MT-Video-Bench. Closed-source models include Gemini 2.5 Pro, Gemini 2.5 Flash, and Doubao-Seed-1.6-vision, while open-source models cover 21 representative MLLMs from Qwen3 VL, Qwen2.5 VL, InternVL3.5, LLaVA, InterVideo, VideoChat, VideoLlama3, and MiniCPM series.

Figure 5. Evaluation results on MT-Video-Bench. "OR" - Object Reference. "MR" - Memory Recall. "CS" - Content Summary. "AR" - Answer Refusal. "TS" - Topic Shifting. "PI" - Proactive Interaction.

📦 More results can been seen here.

Figure 6. The distribution of scene counts involved in each multi-turn dialogue, the effect of scene counts on overall performance and each task performance.

Figure 7. Performance comparison of four MLLMs across diverse video lengths.

Figure 8. Performance comparison of golden context, self-predicted context, and without context for the Qwen3-VL-8B-Instruct model.

Figure 9. Effect of Frame and Resolution Settings.

Figure 10. Effect of Dialogure Turns.

🛠️ Evaluation

STEP 1

We take the InternVL3.5 model as an example and provide the inference script. You can run:

python infer_internvl.py --model_path /path/to/model

STEP 2

To evaluate the inference results, use the following command:

python eval.py

🪶 Citation

If you find MT-Video-Bench useful for your research, please cite:

@misc{pan2025mtvideobenchholisticvideounderstanding,
      title={MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues}, 
      author={Yaning Pan and Zekun Wang and Qianqian Xie and Yongqian Wen and Yuanxing Zhang and Guohui Zhang and Haoxuan Hu and Zhiyu Pan and Yibing Huang and Zhidong Gan and Yonghong Lin and An Ping and Tianhao Peng and Jiaheng Liu},
      year={2025},
      eprint={2510.17722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.17722}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
eval		eval
static		static
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

✨ Introduction

🚀 Overview

🧩 Pipeline

🌟 License

📊 Evaluation Results

🛠️ Evaluation

STEP 1

STEP 2

🪶 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

✨ Introduction

🚀 Overview

🧩 Pipeline

🌟 License

📊 Evaluation Results

🛠️ Evaluation

STEP 1

STEP 2

🪶 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages