🚀 [2025-9-21] Q-Frame has been adopted in The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
🚀 [2025-8-26] Our code is now available on Github.
🚀 [2025-6-29] Our paper "Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs" is now available on arXiv.
🚀 [2025-6-26] Our paper "Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs" accepted at ICCV 2025.
Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.
-
We introduce Q-Frame, a novel framework for queryaware frame selection and multi-resolution frame scaling that improves long-form video understanding by focusing on the most relevant visual content for a given query.
-
We propose a training-free, plug-and-play mechanism that leverages a CLIP-based vision-language model to guide frame selection and resolution assignment without any additional model training or fine-tuning.
-
We validate the effectiveness of Q-Frame through extensive experiments on three benchmark datasets (MLVU, LongVideoBench, and Video-MME, demonstrating its superiority over previous methods and its practical usability for video understanding tasks.
- 🏆 MLVU: Qwen2-VL-7B w/ Q-Frame achieves 65.4 accuracy (+8.5)
- 🏆 LongVideoBench: GPT-4o w/ Q-Frame achieves 58.6 accuracy (+5.3)
- 🏆 Video-MME: GPT-4o w/ Q-Frame achieves solid improvements on various tasks and video durations.
- Please install the lmms-eval following the installation of lmms-eval.
conda create lmms-eval python==3.10.13
conda activate lmms-eval
pip install -e .Please fisrt clone the Long-CLIP from github by running the following command.
git clone https://github.com/beichenzbc/Long-CLIP.gitThen, download the checkpoints of the pretrained model LongCLIP-B and/or LongCLIP-L and place it under ./Long-CLIP/checkpoints
- Before conducting experiments on Q-Frame, please make sure you can run the experimental scripts of baselines successfully.
# For LongVideoBench
bash experiments/longvideobench/gpt4o/baseline.sh
bash experiments/longvideobench/qwen2vl/baseline.sh
# For MLVU
bash experiments/mlvu/gpt4o/baseline.sh
bash experiments/mlvu/qwen2vl/baseline.sh
# For Video-MME
bash experiments/videomme/gpt4o/baseline.sh
bash experiments/videomme/qwen2vl/baseline.sh
# For Video-MME w subtitle
bash experiments/videomme_w_subtitle/gpt4o/baseline.sh
bash experiments/videomme_w_subtitle/qwen2vl/baseline.sh- Now, you can reproduce the performance of Q-Frame.
# For LongVideoBench
bash experiments/longvideobench/gpt4o/qframe.sh
bash experiments/longvideobench/qwen2vl/qframe.sh
# For MLVU
bash experiments/mlvu/gpt4o/qframe.sh
bash experiments/mlvu/qwen2vl/qframe.sh
# For Video-MME
bash experiments/videomme/gpt4o/qframe.sh
bash experiments/videomme/qwen2vl/qframe.sh
# For Video-MME w subtitle
bash experiments/videomme_w_subtitle/gpt4o/qframe.sh
bash experiments/videomme_w_subtitle/qwen2vl/qframe.sh- max_num_frames: the number of candidate frames sampled from a video.
- sample_frames: the number of frames fed into Video-LLMs.
- high_frames: the number of high resolution video frames.
- mid_frames: the number of medium resolution video frames.
- low_frames: the number of low resolution video frames.
Note:
- For the fix frames experimental settings:
- sample_frames = high_frames
- mid_frames = low_frames = 0
- For the fix tokens experimental settings:
- sampled_frames = high_frames + mid_frames/4 + low_frames/16
- lmms-eval: One for All Modalities Evaluation Toolkit - including text, image, video, audio tasks.
- Long-CLIP: [ECCV 2024] Long-CLIP: Unlocking the Long-Text Capability of CLIP
- VILA: VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
- Qwen2-VL: Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
- GPT-4o: A closed-source multimodal large model developed by OpenAI.
Please cite this work if you find it useful:
@article{zhang2025QFrame,
title={Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs},
author={Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan},
journal={arXiv preprint arXiv:2506.22139},
year={2025}
}
