Skip to content

[NeurIPS 2025] ๐“ก๐“ฃ๐“ฅ-๐“‘๐“ฎ๐“ท๐“ฌ๐“ฑ: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video.

Notifications You must be signed in to change notification settings

LJungang/RTV-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ย ย ๐Ÿ“‘ Paperย ย  | ย ย ๐Ÿ  Project Pageย ย  | ๐Ÿค— Hugging Faceย ย  | ๐Ÿค– Model Scopeย ย 

If our project helps you, please give us a star โญ on GitHub to support us.
GitHub stars

๐Ÿ“ฐ News

  • 2026-01-13 ๐ŸŒŸ We updated the evaluation code for VideoChat-Online, and released new results for the model.
  • 2025-12-27 ๐Ÿ“š We released an open-source survey repo on the landscape of video reasoning, covering CoT-based, CoF-based, Interleaved, and Streaming paradigms.

๐Ÿ”Ž Overview

RTV-Bench is a fine-grained benchmark for online/streaming video reasoning with Multimodal Large Language Models (MLLMs).
It targets continuous perception, understanding, and reasoning over long, streaming videos.

RTV-Bench is built around three core ideas:

  • Multi-Timestamp Question Answering: answers evolve as video content changes over time.
  • Hierarchical Question Design: from basic perception to advanced reasoning.
  • Multi-Dimensional Evaluation: assessing continuous perception, understanding, and reasoning jointly.

The benchmark contains 552 videos and 4,608 high-quality QA pairs, covering diverse real-world scenarios.

Video Categories and Distribution of Question Difficulty and Query Characteristics.

(Left) RTV-Bench overs 3 key domains and 16 sub-class video types. (Center) Distribution of question difficulty levels across eight representative task types, measured by percentage-based performance ranges. (Right) Distribution of question queries by video length, categorized into Shallow, Moderate, and Deep levels. The bar heights indicate counts, while the line chart overlays query proportions for each duration bucket.


๐Ÿ› ๏ธ Evaluation

This section introduces the environment setup, data preparation, and evaluation pipeline for RTV-Bench, and presents a minimal working example based on Qwen2.5-VL for model inference and result evaluation.


1. Environment Setup

First, clone the repository and create a dedicated conda environment:

  git clone git@github.com:LJungang/RTV-Bench.git
  cd RTV-Bench

  conda create -n rtv-bench python=3.10
  conda activate rtv-bench

Install the required dependencies:

  pip install transformers==4.57.0
  pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 \
    --index-url https://download.pytorch.org/whl/cu128
  pip install qwen_vl_utils
  pip install accelerate
  pip install opencv-python==4.12.0.88
  pip install decord==0.6.0

  conda install -y -c conda-forge ffmpeg x264

2. Download RTV-Bench Dataset

Download the RTV-Bench dataset from Hugging Face:

  mkdir rtv-bench
  huggingface-cli download \
    --repo-type dataset \
    --resume-download \
    RTVBench/RTV-Bench \
    --local-dir ./rtv-bench \
    --local-dir-use-symlinks False

3. Download Model Checkpoints

Download the Qwen2.5-VL model checkpoints:

  mkdir ckpts
  huggingface-cli download \
    --repo-type model \
    --resume-download \
    Qwen/Qwen2.5-VL-7B-Instruct \
    --local-dir ./ckpts \
    --local-dir-use-symlinks False

Note: Access to the model repository may require a valid Hugging Face token.


4. Data Preparation

Some raw videos may have format or encoding issues. To ensure consistent decoding and stable evaluation, please transcode the raw videos using the provided script before running experiments:

  bash ./scripts/data_preparation/0_transcode_raw_videos_overwrite.sh

RTV-Bench evaluation operates on video clips rather than raw long videos.
Prepare the clips by splitting raw videos according to the provided timestamps:

  bash scripts/data_preparation/1_prepare_rtv_clips.sh

This script preprocesses the raw videos and generates temporally aligned video clips used for inference and evaluation.


5. Model Inference

Run offline inference using Qwen2.5-VL-7B-Instruct:

  bash ./scripts/eval/infer_offline_qwen2_5_vl.sh

The script performs batch inference on the prepared video clips and saves model predictions for subsequent evaluation.


6. Computing Metric

Get the evaluation metric results

  python ./scripts/eval/compute_acc.py --inputs [your_json_file]
  e.g.
  python ./scripts/eval/compute_acc.py --inputs ./eval_results/qwen2.5-VL-*.json

  python ./scripts/eval/compute_score.py --inputs [your_json_file]
  e.g.
  python ./scripts/eval/compute_score.py --inputs ./eval_results/qwen2.5-VL-*.json

7. Quick Evaluation for Other Models

VideoChat-Online
  mkdir baseline
  git clone git@github.com:MCG-NJU/VideoChat-Online.git ./baseline

  conda create -n vco-rtv python=3.9
  conda activate vco-rtv

  pip install -r ./baseline/VideoChat-Online/requirements.txt
  pip install av

  bash ./scripts/eval/infer_online_videochat_online.sh

Note: In our quick evaluation setup, FlashAttention-2 was not installed.

๐Ÿ”– Evaluation Results


๐Ÿ‘ Acknowledgements

We sincerely thank the authors and maintainers of the following projects, whose open-source models, codebases, and released checkpoints have been instrumental to our research and evaluation pipeline:

  • Qwen2.5-VL: The most powerful vision-language model in the Qwen series to date.

  • VideoChat-Online: A robust and efficient model for online video understanding.

Their high-quality implementations and transparent releases provide a solid foundation for reproducible research in video-centric multimodal understanding and benchmarking.

๐Ÿ“‘ Citation

If you find $\mathcal{RTV}\text{-}Bench$ useful for your research and applications, please cite using this BibTeX:

@inproceedings{xun2025rtv,
  title={RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video},
  author={Xun, Shuhang and Tao, Sicheng and Li, Jungang and Shi, Yibo and Lin, Zhixin and Zhu, Zhanhui and Yan, Yibo and Li, Hanqian and Zhang, Linghao and Wang, Shikang and Liu, Yixin and Zhang, Hanbo and Ma, Ying and Hu, Xuming},
  booktitle={Advances in Neural Information Processing Systems},
  volume={38},
  year={2025},
  organization={NeurIPS}
}

About

[NeurIPS 2025] ๐“ก๐“ฃ๐“ฅ-๐“‘๐“ฎ๐“ท๐“ฌ๐“ฑ: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video.

Topics

Resources

Stars

Watchers

Forks