📖 Paper | 📚HuggingFace Dataset
This repo contains the official evaluation code and dataset for the paper "V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions"
- 🔥 700+ samples and 2,500+ questions covering diverse application scenarios and practical challenges, such as diagrams, time estimation, GUI interpretation, and others..
- 🔥 15 real-world scenarios across 4 reasoning categories including Deduction, Guessing, Navigation, and Retrieval
- 🔎 By following the hints of CoQ, VLMs consistently achieve better performance on final questions, demonstrating the importance of exploration in visual reasoning.
- 🔎 Scaling laws persist on VREX, while performance variance among models of the same size is smaller in Following than in Planning.
- 🔎 Following and Planning abilities both contribute positively to the overall performance of the model.
- 🔎 Smaller models are better at Following than Planning while larger models have more balanced performance.
- 🔎 VLMs are better at recovering from failed planning steps than from failed following steps.
We conduct an extensive evaluation of 30+ vision-language models (VLMs) spanning a range of large language model (LLM) sizes and families. Our evaluation includes state-of-the-art models such as GPT, Gemini, InternVL2.5, InternVL3, InternVL3.5, Qwen2.5-VL, and Qwen3-VL. This selection covers a diverse set of architectures, including both proprietary and open-source models, enabling a comprehensive assessment of their reasoning capabilities under different computational constraints.
We provide detailed instructions for evaluation as follows.
from datasets import load_dataset
dataset = load_dataset("umd-zhou-lab/V-REX", split="planning")
dataset = load_dataset("umd-zhou-lab/V-REX", split="following")Inference codes for some VLMs are prepared in inference.
The final result will be saved in folder RESULT_DIR.
python3 inference/planning/infer_pipe.py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORY- The mappings between each model’s full name and its abbreviation are provided in the corresponding model files (see inference/planning/utils).
python3 inference/following/[model_class].py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORYWe provide script to run evaluation metrics based on the inferenced results.
You can run this command to get the final result.
The final result will be saved in folder EVAL_DIR.
python3 evaluation/planning.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY python3 evaluation/following.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY 

