VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Overview

VLM-SubtleBench is a benchmark for evaluating subtle comparative reasoning in vision-language models (VLMs): the ability to identify fine-grained differences between visually similar image pairs. Unlike prior comparison benchmarks that emphasize large and salient changes, VLM-SubtleBench focuses on nuanced variations that are often critical in real-world settings.

The benchmark covers 10 difference types—Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action—and spans diverse domains including natural, industrial, medical, aerial, and synthetic imagery. It supports both multiple-choice and free-form evaluation, enabling systematic analysis of how current VLMs perform across difference types and domains.

Environment Setup

Requires Python >= 3.8.

# Install core dependencies
pip install -r requirements.txt

# Install the package in editable mode
pip install -e .

# Install optional dependencies as needed
pip install anthropic          # Anthropic Claude API
pip install google-genai       # Google Gemini API
pip install transformers torch # Local model support
pip install matplotlib seaborn pandas  # Analysis/visualization

Dataset

Download the dataset from Hugging Face.

By default, the code expects the dataset at VLM-SubtleBench/ in the project root. You can either:

Place (or symlink) the downloaded dataset as VLM-SubtleBench/ in the project root, or
Override the path via CLI:

python scripts/evaluate_multiple_choice.py data.dataset_path="/path/to/your/dataset"

Filtering

Items can be filtered by:

Split: test (default), val, or null (all)
Category: action, attribute, emotion, existence, quality, quantity, spatial, state, temporal, viewpoint
Domain: natural, industrial, medical, aerial, synthetic

Free-form evaluation uses only items where has_caption == true.

API Keys

API keys are loaded from files in the keys/ directory. Create the following structure and add your keys:

keys/
├── openai-key/
│   └── key.env           # Line 1: API key, Line 2 (optional): org key
├── anthropic-key/
│   └── key.env           # Single line: API key
├── google-key/
│   └── gemini_gcp.json   # GCP service account JSON
└── openrouter-key/
    └── key.env           # Single line: API key

You only need to set up keys for the backends you plan to use. For example, if you only use GPT-4o, you only need keys/openai-key/key.env.

Example — setting up an OpenAI key:

mkdir -p keys/openai-key
echo "sk-your-api-key-here" > keys/openai-key/key.env

Running Evaluations

Configuration uses YAML files in configs/ with CLI overrides via key.subkey=value syntax. All options shown below are optional and have sensible defaults.

Multiple-Choice Evaluation

python scripts/evaluate_multiple_choice.py \
  model.llm_name="gpt-4o" \                    # Model to evaluate (default: gpt-4o)
  model.prompt_type="standard" \                # Prompt template (default: standard)
  model.use_multithreading=true \               # Enable concurrent API calls (default: true)
  model.max_workers=8 \                         # Number of threads (default: 8)
  data.max_questions=100 \                      # Limit number of questions, null=all (default: null)
  data.split="test" \                           # Data split: "test", "val", or null for all (default: test)
  data.category="attribute" \                   # Filter by category, null=all (default: null)
  data.domain="natural"                         # Filter by domain, null=all (default: null)

Free-Form Evaluation

Dataset mode — evaluates all captioned items:

python scripts/evaluate_free_form.py \
  data.mode="dataset" \                         # Evaluation mode (default: dataset)
  model.llm_name="gpt-4o" \                    # Model to evaluate (default: gpt-4o)
  model.use_multithreading=true \               # Enable concurrent API calls (default: true)
  model.max_workers=8 \                         # Number of threads (default: 8)
  data.max_pairs=50 \                           # Limit number of pairs, null=all (default: null)
  data.split="test" \                           # Data split: "test", "val", or null for all (default: test)
  data.category="state" \                       # Filter by category, null=all (default: null)
  data.domain="natural"                         # Filter by domain, null=all (default: null)

Pair mode — evaluate a specific image pair:

python scripts/evaluate_free_form.py \
  data.mode="pair" \
  data.first_image="path/to/image1.png" \
  data.second_image="path/to/image2.png"

Using a Local Model

You can use any model served via an OpenAI-compatible API. This works with serving frameworks such as:

SGLang (python -m sglang.launch_server)
vLLM (python -m vllm.entrypoints.openai.api_server)

Step 1: Serve your model

The served model name must start with local_ so it routes to the local backend instead of a cloud provider. Use --served-model-name (or equivalent) to set this.

# Example with SGLang
python -m sglang.launch_server \
  --model Qwen/Qwen3.5-0.8B \
  --port 8000 \
  --served-model-name local_Qwen3.5-0.8B

# Example with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --port 8000 \
  --served-model-name local_Llama-3.2-11B-Vision-Instruct

Step 2: Run evaluation pointing to your server

Provide model.api_key and model.api_base_url via CLI overrides.

python scripts/evaluate_multiple_choice.py \
  model.llm_name="local_Qwen3.5-0.8B" \
  model.api_key="dummy" \
  model.api_base_url="http://localhost:8000/v1" \
  data.max_questions=3

Available Prompt Types

Type	Description
`standard`	Default two-image comparison prompt
`no_reasoning`	Direct answer without chain-of-thought
`camera_augmented`	Adds camera/viewpoint context
`concatenated`	Horizontally concatenates image pair
`grid`	Arranges images in a grid layout
`overlapped`	Overlays images for comparison
`substract`	Shows pixel difference between images

Supported Models

Model routing is determined by substring matching on the model name:

Model Name Pattern	Backend	Example
`gpt-4`, `gpt-5`, `o1`, `o3`, `o4`	OpenAI	`gpt-4o`, `o3`, `gpt-5`
`claude*`	OpenRouter	`anthropic/claude-sonnet-4`
`gemini*`	Google Gemini	`gemini-2.5-flash`, `gemini-2.5-pro`
`llava*`	vLLM Server	`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`
`qwen`, `internvl`	OpenRouter	`qwen/qwen2.5-vl-72b-instruct`
`local_*`	Local (OpenAI-compatible)	`local_Qwen3.5-0.8B`, `local_Llama-3.2-11B`

Results and Logs

Results are saved to:

logs/<evaluator_type>/<model>/<prompt_type>/<dataset>/<timestamp>/
├── run.log                         # Execution log
└── mc_evaluation_results.json      # Predictions, accuracy, costs

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
scripts		scripts
src/vlm_subtlebench		src/vlm_subtlebench
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Overview

Table of Contents

Environment Setup

Dataset

Filtering

API Keys

Running Evaluations

Multiple-Choice Evaluation

Free-Form Evaluation

Using a Local Model

Step 1: Serve your model

Step 2: Run evaluation pointing to your server

Available Prompt Types

Supported Models

Results and Logs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Overview

Table of Contents

Environment Setup

Dataset

Filtering

API Keys

Running Evaluations

Multiple-Choice Evaluation

Free-Form Evaluation

Using a Local Model

Step 1: Serve your model

Step 2: Run evaluation pointing to your server

Available Prompt Types

Supported Models

Results and Logs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages