GitHub - LJungang/RTV-Bench: [NeurIPS 2025] 𝓡𝓣𝓥-𝓑𝓮𝓷𝓬𝓱: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video.

𝓡𝓣𝓥-𝓑𝓮𝓷𝓬𝓱: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

📑 Paper | 🏠 Project Page | 🤗 Hugging Face | 🤖 Model Scope

If our project helps you, please give us a star ⭐ on GitHub to support us.

📰 News

2026-01-13 🌟 We updated the evaluation code for VideoChat-Online, and released new results for the model.
2025-12-27 📚 We released an open-source survey repo on the landscape of video reasoning, covering CoT-based, CoF-based, Interleaved, and Streaming paradigms.

2025-09-20 🎉 Our paper has been accepted by NeurIPS 2025, we will update our dataset and code for community as soon as possible~
2025-06-27 🎉 We update core code for evaluation.
2025-05-17 🎉 We have released the label json, which is named QA.json.
2025-05-04 🎉 We released the paper 𝓡𝓣𝓥-𝓑𝓮𝓷𝓬𝓱: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video.
2025-05-03 🌟 We are happy to release the $\mathcal{RTV}\text{-}Bench$. You can find the $\mathcal{RTV}\text{-}Bench$ from or .

🔎 Overview

RTV-Bench is a fine-grained benchmark for online/streaming video reasoning with Multimodal Large Language Models (MLLMs).
It targets continuous perception, understanding, and reasoning over long, streaming videos.

RTV-Bench is built around three core ideas:

Multi-Timestamp Question Answering: answers evolve as video content changes over time.
Hierarchical Question Design: from basic perception to advanced reasoning.
Multi-Dimensional Evaluation: assessing continuous perception, understanding, and reasoning jointly.

The benchmark contains 552 videos and 4,608 high-quality QA pairs, covering diverse real-world scenarios.

Video Categories and Distribution of Question Difficulty and Query Characteristics.

(Left) RTV-Bench overs 3 key domains and 16 sub-class video types. (Center) Distribution of question difficulty levels across eight representative task types, measured by percentage-based performance ranges. (Right) Distribution of question queries by video length, categorized into Shallow, Moderate, and Deep levels. The bar heights indicate counts, while the line chart overlays query proportions for each duration bucket.

🛠️ Evaluation

This section introduces the environment setup, data preparation, and evaluation pipeline for RTV-Bench, and presents a minimal working example based on Qwen2.5-VL for model inference and result evaluation.

1. Environment Setup

First, clone the repository and create a dedicated conda environment:

  git clone git@github.com:LJungang/RTV-Bench.git
  cd RTV-Bench

  conda create -n rtv-bench python=3.10
  conda activate rtv-bench

Install the required dependencies:

  pip install transformers==4.57.0
  pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 \
    --index-url https://download.pytorch.org/whl/cu128
  pip install qwen_vl_utils
  pip install accelerate
  pip install opencv-python==4.12.0.88
  pip install decord==0.6.0

  conda install -y -c conda-forge ffmpeg x264

2. Download RTV-Bench Dataset

Download the RTV-Bench dataset from Hugging Face:

  mkdir rtv-bench
  huggingface-cli download \
    --repo-type dataset \
    --resume-download \
    RTVBench/RTV-Bench \
    --local-dir ./rtv-bench \
    --local-dir-use-symlinks False

3. Download Model Checkpoints

Download the Qwen2.5-VL model checkpoints:

  mkdir ckpts
  huggingface-cli download \
    --repo-type model \
    --resume-download \
    Qwen/Qwen2.5-VL-7B-Instruct \
    --local-dir ./ckpts \
    --local-dir-use-symlinks False

Note: Access to the model repository may require a valid Hugging Face token.

4. Data Preparation

Some raw videos may have format or encoding issues. To ensure consistent decoding and stable evaluation, please transcode the raw videos using the provided script before running experiments:

  bash ./scripts/data_preparation/0_transcode_raw_videos_overwrite.sh

RTV-Bench evaluation operates on video clips rather than raw long videos.
Prepare the clips by splitting raw videos according to the provided timestamps:

  bash scripts/data_preparation/1_prepare_rtv_clips.sh

This script preprocesses the raw videos and generates temporally aligned video clips used for inference and evaluation.

5. Model Inference

Run offline inference using Qwen2.5-VL-7B-Instruct:

  bash ./scripts/eval/infer_offline_qwen2_5_vl.sh

The script performs batch inference on the prepared video clips and saves model predictions for subsequent evaluation.

6. Computing Metric

Get the evaluation metric results

  python ./scripts/eval/compute_acc.py --inputs [your_json_file]
  e.g.
  python ./scripts/eval/compute_acc.py --inputs ./eval_results/qwen2.5-VL-*.json

  python ./scripts/eval/compute_score.py --inputs [your_json_file]
  e.g.
  python ./scripts/eval/compute_score.py --inputs ./eval_results/qwen2.5-VL-*.json

7. Quick Evaluation for Other Models

VideoChat-Online

  mkdir baseline
  git clone git@github.com:MCG-NJU/VideoChat-Online.git ./baseline

  conda create -n vco-rtv python=3.9
  conda activate vco-rtv

  pip install -r ./baseline/VideoChat-Online/requirements.txt
  pip install av

  bash ./scripts/eval/infer_online_videochat_online.sh

Note: In our quick evaluation setup, FlashAttention-2 was not installed.

🔖 Evaluation Results

👍 Acknowledgements

We sincerely thank the authors and maintainers of the following projects, whose open-source models, codebases, and released checkpoints have been instrumental to our research and evaluation pipeline:

Qwen2.5-VL: The most powerful vision-language model in the Qwen series to date.
VideoChat-Online: A robust and efficient model for online video understanding.

Their high-quality implementations and transparent releases provide a solid foundation for reproducible research in video-centric multimodal understanding and benchmarking.

📑 Citation

If you find $\mathcal{RTV}\text{-}Bench$ useful for your research and applications, please cite using this BibTeX:

@inproceedings{xun2025rtv,
  title={RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video},
  author={Xun, Shuhang and Tao, Sicheng and Li, Jungang and Shi, Yibo and Lin, Zhixin and Zhu, Zhanhui and Yan, Yibo and Li, Hanqian and Zhang, Linghao and Wang, Shikang and Liu, Yixin and Zhang, Hanbo and Ma, Ying and Hu, Xuming},
  booktitle={Advances in Neural Information Processing Systems},
  volume={38},
  year={2025},
  organization={NeurIPS}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
asset		asset
baseline		baseline
scripts		scripts
.gitignore		.gitignore
QA.json		QA.json
README.md		README.md
filtered_clips.json		filtered_clips.json
raw_videos.json		raw_videos.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

𝓡𝓣𝓥-𝓑𝓮𝓷𝓬𝓱: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

📰 News

🔎 Overview

🛠️ Evaluation

1. Environment Setup

2. Download RTV-Bench Dataset

3. Download Model Checkpoints

4. Data Preparation

5. Model Inference

6. Computing Metric

7. Quick Evaluation for Other Models

🔖 Evaluation Results

👍 Acknowledgements

📑 Citation

About

Uh oh!

Languages

LJungang/RTV-Bench

Folders and files

Latest commit

History

Repository files navigation

𝓡𝓣𝓥-𝓑𝓮𝓷𝓬𝓱: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

📰 News

🔎 Overview

🛠️ Evaluation

1. Environment Setup

2. Download RTV-Bench Dataset

3. Download Model Checkpoints

4. Data Preparation

5. Model Inference

6. Computing Metric

7. Quick Evaluation for Other Models

🔖 Evaluation Results

👍 Acknowledgements

📑 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages