ย ย ๐ Paperย ย | ย ย ๐ Project Pageย ย | ๐ค Hugging Faceย ย | ๐ค Model Scopeย ย
If our project helps you, please give us a star โญ on GitHub to support us.
2026-01-13๐ We updated the evaluation code for VideoChat-Online, and released new results for the model.2025-12-27๐ We released an open-source survey repoon the landscape of
video reasoning, covering CoT-based, CoF-based, Interleaved, and Streaming paradigms.
-
2025-09-20๐ Our paper has been accepted by NeurIPS 2025, we will update our dataset and code for community as soon as possible~ -
2025-06-27๐ We update core code for evaluation. -
2025-05-17๐ We have released the label json, which is namedQA.json. -
2025-05-04๐ We released the paper ๐ก๐ฃ๐ฅ-๐๐ฎ๐ท๐ฌ๐ฑ: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video. -
2025-05-03๐ We are happy to release the$\mathcal{RTV}\text{-}Bench$ . You can find the$\mathcal{RTV}\text{-}Bench$ fromor
.
RTV-Bench is a fine-grained benchmark for online/streaming video reasoning with Multimodal Large Language Models (MLLMs).
It targets continuous perception, understanding, and reasoning over long, streaming videos.
RTV-Bench is built around three core ideas:
- Multi-Timestamp Question Answering: answers evolve as video content changes over time.
- Hierarchical Question Design: from basic perception to advanced reasoning.
- Multi-Dimensional Evaluation: assessing continuous perception, understanding, and reasoning jointly.
The benchmark contains 552 videos and 4,608 high-quality QA pairs, covering diverse real-world scenarios.
Video Categories and Distribution of Question Difficulty and Query Characteristics.
(Left) RTV-Bench overs 3 key domains and 16 sub-class video types. (Center) Distribution of question difficulty levels across eight representative task types, measured by percentage-based performance ranges. (Right) Distribution of question queries by video length, categorized into Shallow, Moderate, and Deep levels. The bar heights indicate counts, while the line chart overlays query proportions for each duration bucket.
This section introduces the environment setup, data preparation, and evaluation pipeline for RTV-Bench, and presents a minimal working example based on Qwen2.5-VL for model inference and result evaluation.
First, clone the repository and create a dedicated conda environment:
git clone git@github.com:LJungang/RTV-Bench.git
cd RTV-Bench
conda create -n rtv-bench python=3.10
conda activate rtv-benchInstall the required dependencies:
pip install transformers==4.57.0
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 \
--index-url https://download.pytorch.org/whl/cu128
pip install qwen_vl_utils
pip install accelerate
pip install opencv-python==4.12.0.88
pip install decord==0.6.0
conda install -y -c conda-forge ffmpeg x264
Download the RTV-Bench dataset from Hugging Face:
mkdir rtv-bench
huggingface-cli download \
--repo-type dataset \
--resume-download \
RTVBench/RTV-Bench \
--local-dir ./rtv-bench \
--local-dir-use-symlinks FalseDownload the Qwen2.5-VL model checkpoints:
mkdir ckpts
huggingface-cli download \
--repo-type model \
--resume-download \
Qwen/Qwen2.5-VL-7B-Instruct \
--local-dir ./ckpts \
--local-dir-use-symlinks FalseNote: Access to the model repository may require a valid Hugging Face token.
Some raw videos may have format or encoding issues. To ensure consistent decoding and stable evaluation, please transcode the raw videos using the provided script before running experiments:
bash ./scripts/data_preparation/0_transcode_raw_videos_overwrite.shRTV-Bench evaluation operates on video clips rather than raw long videos.
Prepare the clips by splitting raw videos according to the provided timestamps:
bash scripts/data_preparation/1_prepare_rtv_clips.shThis script preprocesses the raw videos and generates temporally aligned video clips used for inference and evaluation.
Run offline inference using Qwen2.5-VL-7B-Instruct:
bash ./scripts/eval/infer_offline_qwen2_5_vl.shThe script performs batch inference on the prepared video clips and saves model predictions for subsequent evaluation.
Get the evaluation metric results
python ./scripts/eval/compute_acc.py --inputs [your_json_file]
e.g.
python ./scripts/eval/compute_acc.py --inputs ./eval_results/qwen2.5-VL-*.json
python ./scripts/eval/compute_score.py --inputs [your_json_file]
e.g.
python ./scripts/eval/compute_score.py --inputs ./eval_results/qwen2.5-VL-*.jsonVideoChat-Online
mkdir baseline
git clone git@github.com:MCG-NJU/VideoChat-Online.git ./baseline
conda create -n vco-rtv python=3.9
conda activate vco-rtv
pip install -r ./baseline/VideoChat-Online/requirements.txt
pip install av
bash ./scripts/eval/infer_online_videochat_online.shNote: In our quick evaluation setup, FlashAttention-2 was not installed.
We sincerely thank the authors and maintainers of the following projects, whose open-source models, codebases, and released checkpoints have been instrumental to our research and evaluation pipeline:
-
Qwen2.5-VL: The most powerful vision-language model in the Qwen series to date.
-
VideoChat-Online: A robust and efficient model for online video understanding.
Their high-quality implementations and transparent releases provide a solid foundation for reproducible research in video-centric multimodal understanding and benchmarking.
If you find
@inproceedings{xun2025rtv,
title={RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video},
author={Xun, Shuhang and Tao, Sicheng and Li, Jungang and Shi, Yibo and Lin, Zhixin and Zhu, Zhanhui and Yan, Yibo and Li, Hanqian and Zhang, Linghao and Wang, Shikang and Liu, Yixin and Zhang, Hanbo and Ma, Ying and Hu, Xuming},
booktitle={Advances in Neural Information Processing Systems},
volume={38},
year={2025},
organization={NeurIPS}
}
