Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

📰 News

🚀 [2025-9-21] Q-Frame has been adopted in The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

🚀 [2025-8-26] Our code is now available on Github.

🚀 [2025-6-29] Our paper "Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs" is now available on arXiv.

🚀 [2025-6-26] Our paper "Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs" accepted at ICCV 2025.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

Contribution of Q-Frame

We introduce Q-Frame, a novel framework for queryaware frame selection and multi-resolution frame scaling that improves long-form video understanding by focusing on the most relevant visual content for a given query.
We propose a training-free, plug-and-play mechanism that leverages a CLIP-based vision-language model to guide frame selection and resolution assignment without any additional model training or fine-tuning.
We validate the effectiveness of Q-Frame through extensive experiments on three benchmark datasets (MLVU, LongVideoBench, and Video-MME, demonstrating its superiority over previous methods and its practical usability for video understanding tasks.

Performance Highlights

🏆 MLVU: Qwen2-VL-7B w/ Q-Frame achieves 65.4 accuracy (+8.5)
🏆 LongVideoBench: GPT-4o w/ Q-Frame achieves 58.6 accuracy (+5.3)
🏆 Video-MME: GPT-4o w/ Q-Frame achieves solid improvements on various tasks and video durations.

Quick Start

Installation

Please install the lmms-eval following the installation of lmms-eval.

conda create lmms-eval python==3.10.13
conda activate lmms-eval
pip install -e .

Preparation

Please fisrt clone the Long-CLIP from github by running the following command.

git clone https://github.com/beichenzbc/Long-CLIP.git

Then, download the checkpoints of the pretrained model LongCLIP-B and/or LongCLIP-L and place it under ./Long-CLIP/checkpoints

Usage

Before conducting experiments on Q-Frame, please make sure you can run the experimental scripts of baselines successfully.

# For LongVideoBench
bash experiments/longvideobench/gpt4o/baseline.sh
bash experiments/longvideobench/qwen2vl/baseline.sh

# For MLVU
bash experiments/mlvu/gpt4o/baseline.sh
bash experiments/mlvu/qwen2vl/baseline.sh

# For Video-MME
bash experiments/videomme/gpt4o/baseline.sh
bash experiments/videomme/qwen2vl/baseline.sh

# For Video-MME w subtitle
bash experiments/videomme_w_subtitle/gpt4o/baseline.sh
bash experiments/videomme_w_subtitle/qwen2vl/baseline.sh

Now, you can reproduce the performance of Q-Frame.

# For LongVideoBench
bash experiments/longvideobench/gpt4o/qframe.sh
bash experiments/longvideobench/qwen2vl/qframe.sh

# For MLVU
bash experiments/mlvu/gpt4o/qframe.sh
bash experiments/mlvu/qwen2vl/qframe.sh

# For Video-MME
bash experiments/videomme/gpt4o/qframe.sh
bash experiments/videomme/qwen2vl/qframe.sh

# For Video-MME w subtitle
bash experiments/videomme_w_subtitle/gpt4o/qframe.sh
bash experiments/videomme_w_subtitle/qwen2vl/qframe.sh

Command-line Arguments:

max_num_frames: the number of candidate frames sampled from a video.
sample_frames: the number of frames fed into Video-LLMs.
high_frames: the number of high resolution video frames.
mid_frames: the number of medium resolution video frames.
low_frames: the number of low resolution video frames.

Note:

For the fix frames experimental settings:
- sample_frames = high_frames
- mid_frames = low_frames = 0
For the fix tokens experimental settings:
- sampled_frames = high_frames + mid_frames/4 + low_frames/16

Acknowledgement

lmms-eval: One for All Modalities Evaluation Toolkit - including text, image, video, audio tasks.
Long-CLIP: [ECCV 2024] Long-CLIP: Unlocking the Long-Text Capability of CLIP
VILA: VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Qwen2-VL: Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
GPT-4o: A closed-source multimodal large model developed by OpenAI.

Citation

Please cite this work if you find it useful:

@article{zhang2025QFrame,
    title={Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs},
    author={Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan},
    journal={arXiv preprint arXiv:2506.22139},
    year={2025}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Long-CLIP		Long-CLIP
assets		assets
docs		docs
experiments		experiments
lmms_eval.egg-info		lmms_eval.egg-info
lmms_eval		lmms_eval
miscs		miscs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
test_parse.py		test_parse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

📰 News

Abstract

Contribution of Q-Frame

Performance Highlights

Quick Start

Installation

Preparation

Usage

Command-line Arguments:

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

mlKwon/q-frame

Folders and files

Latest commit

History

Repository files navigation

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

📰 News

Abstract

Contribution of Q-Frame

Performance Highlights

Quick Start

Installation

Preparation

Usage

Command-line Arguments:

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages