VREX

This repo contains the official evaluation code and dataset for the paper "V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions"

Highlights

🔥 700+ samples and 2,500+ questions covering diverse application scenarios and practical challenges, such as diagrams, time estimation, GUI interpretation, and others..
🔥 15 real-world scenarios across 4 reasoning categories including Deduction, Guessing, Navigation, and Retrieval

Findings

🔎 By following the hints of CoQ, VLMs consistently achieve better performance on final questions, demonstrating the importance of exploration in visual reasoning.
🔎 Scaling laws persist on VREX, while performance variance among models of the same size is smaller in Following than in Planning.
🔎 Following and Planning abilities both contribute positively to the overall performance of the model.
🔎 Smaller models are better at Following than Planning while larger models have more balanced performance.
🔎 VLMs are better at recovering from failed planning steps than from failed following steps.

Dataset Introduction

VLMs' Result

We conduct an extensive evaluation of 30+ vision-language models (VLMs) spanning a range of large language model (LLM) sizes and families. Our evaluation includes state-of-the-art models such as GPT, Gemini, InternVL2.5, InternVL3, InternVL3.5, Qwen2.5-VL, and Qwen3-VL. This selection covers a diverse set of architectures, including both proprietary and open-source models, enabling a comprehensive assessment of their reasoning capabilities under different computational constraints.

Evaluation Pipeline

We provide detailed instructions for evaluation as follows.

Dataset Installation

from datasets import load_dataset
dataset = load_dataset("umd-zhou-lab/V-REX", split="planning")
dataset = load_dataset("umd-zhou-lab/V-REX", split="following")

Inference with Models

Inference codes for some VLMs are prepared in inference. The final result will be saved in folder RESULT_DIR.

Planning Task

python3 inference/planning/infer_pipe.py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORY

The mappings between each model’s full name and its abbreviation are provided in the corresponding model files (see inference/planning/utils).

Following Task

python3 inference/following/[model_class].py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORY

Evaluation Results

We provide script to run evaluation metrics based on the inferenced results. You can run this command to get the final result. The final result will be saved in folder EVAL_DIR.

Planning Task

python3 evaluation/planning.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY

Following Task

python3 evaluation/following.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
evaluation		evaluation
images		images
inference		inference
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VREX

Highlights

Findings

Dataset Introduction

VLMs' Result

Evaluation Pipeline

Dataset Installation

Inference with Models

Planning Task

Following Task

Evaluation Results

Planning Task

Following Task

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

tianyi-lab/VREX

Folders and files

Latest commit

History

Repository files navigation

VREX

Highlights

Findings

Dataset Introduction

VLMs' Result

Evaluation Pipeline

Dataset Installation

Inference with Models

Planning Task

Following Task

Evaluation Results

Planning Task

Following Task

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages