Skip to content

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

License

Notifications You must be signed in to change notification settings

tianyi-lab/VREX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VREX

📖 Paper | 📚HuggingFace Dataset

Description of Planning and Following Tasks

This repo contains the official evaluation code and dataset for the paper "V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions"

Highlights

  • 🔥 700+ samples and 2,500+ questions covering diverse application scenarios and practical challenges, such as diagrams, time estimation, GUI interpretation, and others..
  • 🔥 15 real-world scenarios across 4 reasoning categories including Deduction, Guessing, Navigation, and Retrieval

Findings

  • 🔎 By following the hints of CoQ, VLMs consistently achieve better performance on final questions, demonstrating the importance of exploration in visual reasoning.
  • 🔎 Scaling laws persist on VREX, while performance variance among models of the same size is smaller in Following than in Planning.
  • 🔎 Following and Planning abilities both contribute positively to the overall performance of the model.
  • 🔎 Smaller models are better at Following than Planning while larger models have more balanced performance.
  • 🔎 VLMs are better at recovering from failed planning steps than from failed following steps.

Dataset Introduction

Description of Image

Description of Image

VLMs' Result

We conduct an extensive evaluation of 30+ vision-language models (VLMs) spanning a range of large language model (LLM) sizes and families. Our evaluation includes state-of-the-art models such as GPT, Gemini, InternVL2.5, InternVL3, InternVL3.5, Qwen2.5-VL, and Qwen3-VL. This selection covers a diverse set of architectures, including both proprietary and open-source models, enabling a comprehensive assessment of their reasoning capabilities under different computational constraints.

Evaluation Pipeline

We provide detailed instructions for evaluation as follows.

Dataset Installation

from datasets import load_dataset
dataset = load_dataset("umd-zhou-lab/V-REX", split="planning")
dataset = load_dataset("umd-zhou-lab/V-REX", split="following")

Inference with Models

Inference codes for some VLMs are prepared in inference. The final result will be saved in folder RESULT_DIR.

Planning Task

python3 inference/planning/infer_pipe.py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORY
  • The mappings between each model’s full name and its abbreviation are provided in the corresponding model files (see inference/planning/utils).

Following Task

python3 inference/following/[model_class].py --model=$MODEL --save_dir=$RESULT_DIR --category=$CATEGORY

Evaluation Results

We provide script to run evaluation metrics based on the inferenced results. You can run this command to get the final result. The final result will be saved in folder EVAL_DIR.

Planning Task

python3 evaluation/planning.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY 

Following Task

python3 evaluation/following.py --result_dir=$RESULT_DIR --eval_dir=$EVAL_DIR --model=$MODEL --category=$CATEGORY 

About

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages