VLM-SubtleBench is a benchmark for evaluating subtle comparative reasoning in vision-language models (VLMs): the ability to identify fine-grained differences between visually similar image pairs. Unlike prior comparison benchmarks that emphasize large and salient changes, VLM-SubtleBench focuses on nuanced variations that are often critical in real-world settings.
The benchmark covers 10 difference types—Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action—and spans diverse domains including natural, industrial, medical, aerial, and synthetic imagery. It supports both multiple-choice and free-form evaluation, enabling systematic analysis of how current VLMs perform across difference types and domains.
- Environment Setup
- Dataset
- API Keys
- Running Evaluations
- Using a Local Model
- Supported Models
- Results and Logs
Requires Python >= 3.8.
# Install core dependencies
pip install -r requirements.txt
# Install the package in editable mode
pip install -e .
# Install optional dependencies as needed
pip install anthropic # Anthropic Claude API
pip install google-genai # Google Gemini API
pip install transformers torch # Local model support
pip install matplotlib seaborn pandas # Analysis/visualizationDownload the dataset from Hugging Face.
By default, the code expects the dataset at VLM-SubtleBench/ in the project root. You can either:
- Place (or symlink) the downloaded dataset as
VLM-SubtleBench/in the project root, or - Override the path via CLI:
python scripts/evaluate_multiple_choice.py data.dataset_path="/path/to/your/dataset"Items can be filtered by:
- Split:
test(default),val, ornull(all) - Category:
action,attribute,emotion,existence,quality,quantity,spatial,state,temporal,viewpoint - Domain:
natural,industrial,medical,aerial,synthetic
Free-form evaluation uses only items where has_caption == true.
API keys are loaded from files in the keys/ directory. Create the following structure and add your keys:
keys/
├── openai-key/
│ └── key.env # Line 1: API key, Line 2 (optional): org key
├── anthropic-key/
│ └── key.env # Single line: API key
├── google-key/
│ └── gemini_gcp.json # GCP service account JSON
└── openrouter-key/
└── key.env # Single line: API key
You only need to set up keys for the backends you plan to use. For example, if you only use GPT-4o, you only need keys/openai-key/key.env.
Example — setting up an OpenAI key:
mkdir -p keys/openai-key
echo "sk-your-api-key-here" > keys/openai-key/key.envConfiguration uses YAML files in configs/ with CLI overrides via key.subkey=value syntax. All options shown below are optional and have sensible defaults.
python scripts/evaluate_multiple_choice.py \
model.llm_name="gpt-4o" \ # Model to evaluate (default: gpt-4o)
model.prompt_type="standard" \ # Prompt template (default: standard)
model.use_multithreading=true \ # Enable concurrent API calls (default: true)
model.max_workers=8 \ # Number of threads (default: 8)
data.max_questions=100 \ # Limit number of questions, null=all (default: null)
data.split="test" \ # Data split: "test", "val", or null for all (default: test)
data.category="attribute" \ # Filter by category, null=all (default: null)
data.domain="natural" # Filter by domain, null=all (default: null)Dataset mode — evaluates all captioned items:
python scripts/evaluate_free_form.py \
data.mode="dataset" \ # Evaluation mode (default: dataset)
model.llm_name="gpt-4o" \ # Model to evaluate (default: gpt-4o)
model.use_multithreading=true \ # Enable concurrent API calls (default: true)
model.max_workers=8 \ # Number of threads (default: 8)
data.max_pairs=50 \ # Limit number of pairs, null=all (default: null)
data.split="test" \ # Data split: "test", "val", or null for all (default: test)
data.category="state" \ # Filter by category, null=all (default: null)
data.domain="natural" # Filter by domain, null=all (default: null)Pair mode — evaluate a specific image pair:
python scripts/evaluate_free_form.py \
data.mode="pair" \
data.first_image="path/to/image1.png" \
data.second_image="path/to/image2.png"You can use any model served via an OpenAI-compatible API. This works with serving frameworks such as:
The served model name must start with local_ so it routes to the local backend instead of a cloud provider. Use --served-model-name (or equivalent) to set this.
# Example with SGLang
python -m sglang.launch_server \
--model Qwen/Qwen3.5-0.8B \
--port 8000 \
--served-model-name local_Qwen3.5-0.8B
# Example with vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-11B-Vision-Instruct \
--port 8000 \
--served-model-name local_Llama-3.2-11B-Vision-InstructProvide model.api_key and model.api_base_url via CLI overrides.
python scripts/evaluate_multiple_choice.py \
model.llm_name="local_Qwen3.5-0.8B" \
model.api_key="dummy" \
model.api_base_url="http://localhost:8000/v1" \
data.max_questions=3| Type | Description |
|---|---|
standard |
Default two-image comparison prompt |
no_reasoning |
Direct answer without chain-of-thought |
camera_augmented |
Adds camera/viewpoint context |
concatenated |
Horizontally concatenates image pair |
grid |
Arranges images in a grid layout |
overlapped |
Overlays images for comparison |
substract |
Shows pixel difference between images |
Model routing is determined by substring matching on the model name:
| Model Name Pattern | Backend | Example |
|---|---|---|
gpt-4*, gpt-5, o1*, o3*, o4* |
OpenAI | gpt-4o, o3, gpt-5 |
claude* |
OpenRouter | anthropic/claude-sonnet-4 |
gemini* |
Google Gemini | gemini-2.5-flash, gemini-2.5-pro |
llava* |
vLLM Server | llava-hf/llava-onevision-qwen2-0.5b-ov-hf |
qwen*, internvl* |
OpenRouter | qwen/qwen2.5-vl-72b-instruct |
local_* |
Local (OpenAI-compatible) | local_Qwen3.5-0.8B, local_Llama-3.2-11B |
Results are saved to:
logs/<evaluator_type>/<model>/<prompt_type>/<dataset>/<timestamp>/
├── run.log # Execution log
└── mc_evaluation_results.json # Predictions, accuracy, costs