SD-RPN

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Updates

Jan. 27, 2026: Our paper is accepted by ICLR2026.
Oct. 17, 2025: We release the code, data and ckpt upon Qwen2.5-VL.
Sep. 21, 2025: We release the code, data and ckpt for SD-RPN.

TODO

Release code and weights for DeepSeek-VL

Introduction

While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process for RoI identification. We propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. Our core innovation is a pipeline that processes and denoises the noisy cross-attention maps from the MLLM's middle layers to generate pseudo-RoI labels. We then use these labels to train a lightweight and tunable Region Proposal Network (RPN) that is built upon the frozen MLLM backbone. Our RPN predicts the RoI in a single forward pass using features available from the MLLM's middle layers, completely decoupling RoI identification from the auto-regressive generation process and avoiding costly multi-pass operations.

Main Results

Getting Started with SD-RPN

We provide the code and instructions to train and evaluate SD-RPN based on LLaVA. Please follow the instructions below. For Qwen2.5-VL, please check the README in ./Qwen2.5-VL/README.md.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/YuHengsss/SD-RPN.git
cd LLaVA

Install Package

conda create -n llava_roi python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn==2.1.1 --no-build-isolation

Note: we provide some issue fixes during installation and training in issue_fix.

Training

To train SD-RPN upon LLaVA, please follow the instructions below:

Download the pkl annotation and move it in your dataset folder.

# export HF_ENDPOINT=https://hf-mirror.com # for China users
huggingface-cli download YuhengSSS/roi_pseudo --local-dir ./

Download the GQA and OCR-VQA in your dataset folder. We provide the download script in scripts/download_datasets.sh. Change the DATAPATH and CODEPATH in the script to your own folder.
Start training! Note that before training, you need to configure the miniconda3 path, HUGGINGFACE_HUB_CACHE and DATASET_PATH in the scripts under scripts/finetune_rois.sh and scripts/finetune_rois_13b.sh to your own path. It takes less than 4 hours to train SD-RPN+7B on 4 A6000 GPUs.

# for 7B
bash scripts/finetune_rois.sh

# for 13B
bash scripts/finetune_rois_13b.sh

Merge the checkpoint with the original LLaVA weights to obtain the final model.

python migrate_weights.py #change FINETUNED_MODEL_PATH, ORIGINAL_MODEL_PATH and DSTINATION_MODEL_PATH to your own path

(Optional) If you wish to train RPN using LLaVA's SFT response:

huggingface-cli download liuhaotian/LLaVA-Instruct-150K --repo-type dataset \
  --include "llava_v1_5_mix665k.json" --local-dir ./ #replace with your own path

#get subset for RPN training
python make_152k_samples.py #change the path to your own path

bash scripts/finetune_rois_sft_label.sh #change the path to your own path

Inference

We utilize lmms-eval to evaluate the model. Please follow the instructions below:

Download the pretrained model and move it to your checkpoints folder if you want to evaluate our pretrained model.

# export HF_ENDPOINT=https://hf-mirror.com # for China users
#7B
huggingface-cli download YuhengSSS/llava-v1.5-7b-roi-K15T3-152k-v1bf16Mheads-twiginit-filled --repo-type model --local-dir ./

#13B, need to run migrate_weights to merge weights, it is not the complete model.
huggingface-cli download YuhengSSS/llava-v1.5-13b-roi-K15T3-152k-v1bf16Mheads-twiginit --local-dir ./  --repo-type model

Install lmms-eval, check the script in lmms-eval/README.md.
Run the evaluation script in lmms-eval. Change the checkpoint_path to your own path.

bash lmms-eval/examples/A6000/reproduce.sh

Make pseudo labels

Generate pseudo labels for OCR-VQA and GQA datasets

python make_jsonl.py #change the path to your own path
python make_pseudo_label.py #change the path to your own path
python make_pseudo_label2.py #change the path to your own path

Merge the pseudo labels

python merge_pkls.py #change the path to your own path

Then you will get the pseudo label like llava_v1_5_7b_pseudo_roi_release.pkl and llava_v1_5_13b_pseudo_roi_release.pkl.

Citation

If you find SD-RPN useful in your research, please consider citing the following paper:

@article{shi2025catching,
  title={Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception},
  author={Shi, Yuheng and Pei, Xiaohuan and Dong, Minjing and Xu, Chang},
  journal={arXiv preprint arXiv:2509.16944},
  year={2025}
}

Acknowledgement

This project is based on LLaVA, Qwen2.5-VL, lmms-eval and DeepSeek-VL. We sincerely thank the authors for their great work and open-sourcing the code.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
LLaVA		LLaVA
Qwen2.5-VL		Qwen2.5-VL
assets		assets
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SD-RPN

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Updates

TODO

Introduction

Main Results

Getting Started with SD-RPN

Training

Inference

Make pseudo labels

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

YuHengsss/SD-RPN

Folders and files

Latest commit

History

Repository files navigation

SD-RPN

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Updates

TODO

Introduction

Main Results

Getting Started with SD-RPN

Training

Inference

Make pseudo labels

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages