MVU-Eval:
Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📋 Abstract

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing benchmarks remain limited to single-video understanding.
To address this gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating multi-video understanding in MLLMs.

MVU-Eval contains 1,824 carefully curated QA pairs spanning 4,959 videos from diverse domains, covering both fundamental perception and high-order reasoning tasks.
It assesses eight core competencies: Object Recognition, Spatial Understanding, Counting, Comparison, Knowledge-Intensive Reasoning, In-Context Learning, Retrieval-Augmented Generation, and Temporal Reasoning.

Figure 1: Representative examples in MVU-Eval.

🌟 Key Features

🎯 First Multi-Video Understanding Benchmark
1,824 QA pairs and 4,959 videos across 8 task categories, bridging perception ↔ reasoning.
🧩 Eight Core Competencies
Object Recognition (OR), Spatial Understanding (SU), Counting, Comparison, Knowledge-Intensive Reasoning (KIR), In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and Temporal Reasoning (TR).
⚙️ Rigorous Data Pipeline
Automated QA generation + dual-round human verification + leakage and utility checks ensure quality and fairness.
📊 Comprehensive Evaluation
Benchmarked on 30+ open/closed-source MLLMs (e.g., Gemini 2.5 Pro, GPT-4o, Qwen 2.5-VL, InternVL 3), revealing major performance gaps.

📈 Benchmark Statistics

Experimental Results

📦 Data

Please download the dataset from:
👉 https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data

After downloading, extract the data into the ./MVU-Eval-Data/ directory.
It contains all video clips (.mp4) and the corresponding QA annotation files (.json).

💻 Scripts

Below is an example of how to launch the Qwen/Qwen2.5-VL-3B-Instruct model using vLLM
and run the inference script for evaluation.

1) Start the vLLM Server

# Start vLLM server (example: Qwen/Qwen2.5-VL-3B-Instruct)
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --served-model-name Qwen/Qwen2.5-VL-3B-Instruct \
    --api-key sk-abc123 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1 \
    --trust-remote-code \
    --dtype auto \
    --gpu-memory-utilization 0.85 \
    --port 8007 \
    --host localhost

Note: Adjust --tensor-parallel-size to your GPU count and memory. If you use another port, update --port in the next step accordingly.

2) Run Inference

Then navigate to the inference directory and run the main inference script:

cd inference

# Replace paths/filenames as needed:
python inference/main.py \
  --model_name Qwen/Qwen2.5-VL-3B-Instruct \
  --port 8007 \
  --data_filename QA_json_file.json \
  --data_root /path/to/MVU-Eval-Data/videos \
  --nframes 32 \
  --max_pixels 720

--data_filename points to a JSON under QA_output/ (e.g., QA_json_file.json).
--data_root is the root directory containing all videos used in the QA file.
--nframes (default: 32) is the number of uniformly sampled frames per video.
--max_pixels (default: 720) is the max side for frame resizing.

After execution, predictions will be saved under:

inference/Model_output/max_pixel_{max_pixels}_nframes_{nframes}/{QA_json_file_stem}/main/

3) Analyze Results

# Generate per-task and overall accuracy tables/plots from saved predictions
python inference/analyze.py

The analysis script will:

Aggregate results from Model_output/…/*.json
Compute overall and task-wise accuracy
Export a markdown table and save comparison plots for reporting

🪶 Citation

If you find MVU-Eval useful for your research, please cite:

@inproceedings{
  peng2025mvueval,
  title={{MVU}-Eval: Towards Multi-Video Understanding Evaluation for Multimodal {LLM}s},
  author={Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2025},
  url={https://openreview.net/forum?id=UZD5CQV6f9}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
inference		inference
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVU-Eval:
Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📋 Abstract

🌟 Key Features

📈 Benchmark Statistics

Experimental Results

📦 Data

💻 Scripts

1) Start the vLLM Server

2) Run Inference

3) Analyze Results

🪶 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📋 Abstract

🌟 Key Features

📈 Benchmark Statistics

Experimental Results

📦 Data

💻 Scripts

1) Start the vLLM Server

2) Run Inference

3) Analyze Results

🪶 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MVU-Eval:
Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Packages