CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

Method Overview

Env Setup

git clone https://github.com/miguelscarv/cropvlm.git
cd cropvlm
mkdir models predictions datasets
conda create -n cropvlm python=3.12
conda activate cropvlm
pip install -r requirements.txt

Training

Data Preparation

To generate the synthetic bounding box dataset used for SFT using Qwen/Qwen2.5-VL-7B-Instruct.

python3 create_dataset.py

This will, by default, apply the method described in the paper only to the TextVQA training split. To fully replicate the paper, concatenate TextVQA, DocVQA, ST-VQA and InfographicsVQA into a single HuggingFace dataset and add --dataset_path <DATASET_PATH>.

SFT Stage

First, train a model capable of generating bounding boxes in the intended format.

bash scripts/train_sft.sh 0

GRPO Stage

Then, finetune the cropping network using GRPO. Be sure to replace the --base_model argument with the previsouly trained SFT model path.

bash scripts/train_grpo.sh 0

Inference

Bounding Box Generation

To generate bounding boxes for the TextVQA validation split.

python3 generate_bbox.py --model_path <CROPVLM_MODEL_PATH>

The bounding box predictions will then be stored in predictions.

Final Answer Generation

To generate the final answers using SmolVLM (with and without the generated crops).

python3 generate_final_answers.py --bbox predictions/textvqa_bbox.json

These will also be store in predictions.

Evaluation

To caculate VQA Accuracy for the predictions generated above.

python3 vqa_accuracy.py --predictions_file <PREDICTIONS_PATH>

Bibtex

If you find CropVLM helpful for your work, please cite

@article{carvalho2025cropvlm,
  title={CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception},
  author={Carvalho, Miguel and Dias, Helder and Martins, Bruno},
  journal={arXiv preprint arXiv:2511.19820},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_datasets.py		create_datasets.py
generate_bboxes.py		generate_bboxes.py
generate_final_answers.py		generate_final_answers.py
grpo.py		grpo.py
requirements.txt		requirements.txt
sft.py		sft.py
trainer.py		trainer.py
utils.py		utils.py
vqa_accuracy.py		vqa_accuracy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Abstract

Method Overview

Env Setup

Training

Data Preparation

SFT Stage

GRPO Stage

Inference

Bounding Box Generation

Final Answer Generation

Evaluation

Bibtex

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Abstract

Method Overview

Env Setup

Training

Data Preparation

SFT Stage

GRPO Stage

Inference

Bounding Box Generation

Final Answer Generation

Evaluation

Bibtex

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages