Skip to content

NJU-LINK/MT-Video-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

link Logo

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Homepage Arxiv Paper Huggingface

✨ Introduction

Recent advances in multimodal large language models (MLLMs) have brought remarkable progress in video understanding.
However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios.

🎬 MT-Video-Bench fills this gap.
It emphasizes cross-scene reasoning, long-range dependencies, and interactive adaptability, thereby aligning closely with real-world application demands.

Illustration of MT-Video-Bench
Figure 1. Illustration of multi-turn dialogues under single-scene and cross-scene settings. The evaluated questions corresponding to tasks are marked with underlining, and the scenes involved in the entire multi-turn dialogues are marked with blue dotted boxes.

🚀 Overview

MT-Video-Bench‘s information:

  • 📌 135 videos from 5 major categories & 23 subcategories

  • 💬 1,000 dialogues (each with 5–8 turns) and 5,887 QA pairs for evaluating six core abilities

    • Object Reference
    • Memory Recall
    • Content Summary
    • Answer Refusal
    • Topic Shifting
    • Proactive Interaction
  • 🧮 Long-Video Evaluation: durations up to 20 minutes

  • 🧠 Very challenging, even 🥇 high-performance model achieving only ⚠️ 76.95 % overall accuracy, revealing the considerable difficulty of this dataset.

Statistics of multi-turn dialogues
Figure 2. It covers a broad range of topics across five main categories: Movie, TV, Sports, Knowledge, and Life Record, each with multiple sub-topics, ensuring a diverse and balanced data distribution.

MT-Video-Bench is a new multi-turn video understanding benchmark that lets you easily compare our dataset with existing video-language benchmarks.

Comparison with other benchmarks
Figure 3. Comparison with other benchmarks. Avg. Q/V - the average number of QA pairs per video. \textbf{Long}: whether the average video length is greater than 10 minutes. Q. Div. and A. Div.: the lexical diversity of questions and answers. Q. Len. and A. Len.: the average number of tokens of questions and answers, computed with the LLaMA-3.1-8B tokenizer.

🧩 Pipeline

A glance at how MT-Video-Bench was built👇

  1. 🔎 Video Collection & Single-Scene Splitting: Manually collect videos → split into short clips using PySceneDetect → generate captions for each clip → merge related clips based on captions to form coherent single-scene videos.
  2. 🧾 Cross-Scene Video Merging: Extract key frames → perform object detection → build a dynamic object memory bank → retrieve and merge segments sharing common objects or themes.
  3. 📦 Multi-Turn Dialogue Generation: Use Gemini 2.5 to automatically generate single-scene and cross-scene multi-turn dialogues → select the most suitable task for each scene → design cross-scene questions with an object-centered approach.
  4. 🚦 Human Quality Control: Remove cases with information leakage → manually verify QA alignment, factual correctness, and difficulty → ensure high-quality, contextually coherent multi-turn dialogues.

Data Pipeline
Figure 4. Data construction and refinement pipeline of MT-Video-Bench.

🌟 License

Our dataset is under the CC-BY-NC-SA-4.0 license.

⚠️ If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors.

If the original authors of the related works still believe that the videos should be removed, please contact ynpan24@m.fudan.edu.cn or directly raise an issue.

📊 Evaluation Results

We evaluate both closed- and open-source MLLMs on MT-Video-Bench. Closed-source models include Gemini 2.5 Pro, Gemini 2.5 Flash, and Doubao-Seed-1.6-vision, while open-source models cover 21 representative MLLMs from Qwen3 VL, Qwen2.5 VL, InternVL3.5, LLaVA, InterVideo, VideoChat, VideoLlama3, and MiniCPM series.

Main Results
Figure 5. Evaluation results on MT-Video-Bench. "OR" - Object Reference. "MR" - Memory Recall. "CS" - Content Summary. "AR" - Answer Refusal. "TS" - Topic Shifting. "PI" - Proactive Interaction.

📦 More results can been seen here.

Comparison on single-scene vs cross-scene
Figure 6. The distribution of scene counts involved in each multi-turn dialogue, the effect of scene counts on overall performance and each task performance.

Comparison on video length
Figure 7. Performance comparison of four MLLMs across diverse video lengths.

Comparison on context
Figure 8. Performance comparison of golden context, self-predicted context, and without context for the Qwen3-VL-8B-Instruct model.

Comparison on ablation
Figure 9. Effect of Frame and Resolution Settings.

Comparison on resolution
Figure 10. Effect of Dialogure Turns.

🛠️ Evaluation

STEP 1

We take the InternVL3.5 model as an example and provide the inference script. You can run:

python infer_internvl.py --model_path /path/to/model

STEP 2

To evaluate the inference results, use the following command:

python eval.py

🪶 Citation

If you find MT-Video-Bench useful for your research, please cite:

@misc{pan2025mtvideobenchholisticvideounderstanding,
      title={MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues}, 
      author={Yaning Pan and Zekun Wang and Qianqian Xie and Yongqian Wen and Yuanxing Zhang and Guohui Zhang and Haoxuan Hu and Zhiyu Pan and Yibing Huang and Zhidong Gan and Yonghong Lin and An Ping and Tianhao Peng and Jiaheng Liu},
      year={2025},
      eprint={2510.17722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.17722}, 
}

About

The Source Code for MT-Video-Bench @ ACL Findings 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages