Skip to content

krafton-ai/VLM-SubtleBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

arXiv Dataset

Overview

VLM-SubtleBench is a benchmark for evaluating subtle comparative reasoning in vision-language models (VLMs): the ability to identify fine-grained differences between visually similar image pairs. Unlike prior comparison benchmarks that emphasize large and salient changes, VLM-SubtleBench focuses on nuanced variations that are often critical in real-world settings.

The benchmark covers 10 difference typesAttribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action—and spans diverse domains including natural, industrial, medical, aerial, and synthetic imagery. It supports both multiple-choice and free-form evaluation, enabling systematic analysis of how current VLMs perform across difference types and domains.

Table of Contents

Environment Setup

Requires Python >= 3.8.

# Install core dependencies
pip install -r requirements.txt

# Install the package in editable mode
pip install -e .

# Install optional dependencies as needed
pip install anthropic          # Anthropic Claude API
pip install google-genai       # Google Gemini API
pip install transformers torch # Local model support
pip install matplotlib seaborn pandas  # Analysis/visualization

Dataset

Download the dataset from Hugging Face.

By default, the code expects the dataset at VLM-SubtleBench/ in the project root. You can either:

  • Place (or symlink) the downloaded dataset as VLM-SubtleBench/ in the project root, or
  • Override the path via CLI:
python scripts/evaluate_multiple_choice.py data.dataset_path="/path/to/your/dataset"

Filtering

Items can be filtered by:

  • Split: test (default), val, or null (all)
  • Category: action, attribute, emotion, existence, quality, quantity, spatial, state, temporal, viewpoint
  • Domain: natural, industrial, medical, aerial, synthetic

Free-form evaluation uses only items where has_caption == true.

API Keys

API keys are loaded from files in the keys/ directory. Create the following structure and add your keys:

keys/
├── openai-key/
│   └── key.env           # Line 1: API key, Line 2 (optional): org key
├── anthropic-key/
│   └── key.env           # Single line: API key
├── google-key/
│   └── gemini_gcp.json   # GCP service account JSON
└── openrouter-key/
    └── key.env           # Single line: API key

You only need to set up keys for the backends you plan to use. For example, if you only use GPT-4o, you only need keys/openai-key/key.env.

Example — setting up an OpenAI key:

mkdir -p keys/openai-key
echo "sk-your-api-key-here" > keys/openai-key/key.env

Running Evaluations

Configuration uses YAML files in configs/ with CLI overrides via key.subkey=value syntax. All options shown below are optional and have sensible defaults.

Multiple-Choice Evaluation

python scripts/evaluate_multiple_choice.py \
  model.llm_name="gpt-4o" \                    # Model to evaluate (default: gpt-4o)
  model.prompt_type="standard" \                # Prompt template (default: standard)
  model.use_multithreading=true \               # Enable concurrent API calls (default: true)
  model.max_workers=8 \                         # Number of threads (default: 8)
  data.max_questions=100 \                      # Limit number of questions, null=all (default: null)
  data.split="test" \                           # Data split: "test", "val", or null for all (default: test)
  data.category="attribute" \                   # Filter by category, null=all (default: null)
  data.domain="natural"                         # Filter by domain, null=all (default: null)

Free-Form Evaluation

Dataset mode — evaluates all captioned items:

python scripts/evaluate_free_form.py \
  data.mode="dataset" \                         # Evaluation mode (default: dataset)
  model.llm_name="gpt-4o" \                    # Model to evaluate (default: gpt-4o)
  model.use_multithreading=true \               # Enable concurrent API calls (default: true)
  model.max_workers=8 \                         # Number of threads (default: 8)
  data.max_pairs=50 \                           # Limit number of pairs, null=all (default: null)
  data.split="test" \                           # Data split: "test", "val", or null for all (default: test)
  data.category="state" \                       # Filter by category, null=all (default: null)
  data.domain="natural"                         # Filter by domain, null=all (default: null)

Pair mode — evaluate a specific image pair:

python scripts/evaluate_free_form.py \
  data.mode="pair" \
  data.first_image="path/to/image1.png" \
  data.second_image="path/to/image2.png"

Using a Local Model

You can use any model served via an OpenAI-compatible API. This works with serving frameworks such as:

  • SGLang (python -m sglang.launch_server)
  • vLLM (python -m vllm.entrypoints.openai.api_server)

Step 1: Serve your model

The served model name must start with local_ so it routes to the local backend instead of a cloud provider. Use --served-model-name (or equivalent) to set this.

# Example with SGLang
python -m sglang.launch_server \
  --model Qwen/Qwen3.5-0.8B \
  --port 8000 \
  --served-model-name local_Qwen3.5-0.8B

# Example with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --port 8000 \
  --served-model-name local_Llama-3.2-11B-Vision-Instruct

Step 2: Run evaluation pointing to your server

Provide model.api_key and model.api_base_url via CLI overrides.

python scripts/evaluate_multiple_choice.py \
  model.llm_name="local_Qwen3.5-0.8B" \
  model.api_key="dummy" \
  model.api_base_url="http://localhost:8000/v1" \
  data.max_questions=3

Available Prompt Types

Type Description
standard Default two-image comparison prompt
no_reasoning Direct answer without chain-of-thought
camera_augmented Adds camera/viewpoint context
concatenated Horizontally concatenates image pair
grid Arranges images in a grid layout
overlapped Overlays images for comparison
substract Shows pixel difference between images

Supported Models

Model routing is determined by substring matching on the model name:

Model Name Pattern Backend Example
gpt-4*, gpt-5, o1*, o3*, o4* OpenAI gpt-4o, o3, gpt-5
claude* OpenRouter anthropic/claude-sonnet-4
gemini* Google Gemini gemini-2.5-flash, gemini-2.5-pro
llava* vLLM Server llava-hf/llava-onevision-qwen2-0.5b-ov-hf
qwen*, internvl* OpenRouter qwen/qwen2.5-vl-72b-instruct
local_* Local (OpenAI-compatible) local_Qwen3.5-0.8B, local_Llama-3.2-11B

Results and Logs

Results are saved to:

logs/<evaluator_type>/<model>/<prompt_type>/<dataset>/<timestamp>/
├── run.log                         # Execution log
└── mc_evaluation_results.json      # Predictions, accuracy, costs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors