Skip to content

WeChatCV/UnicBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

πŸ“Œ Abstract

With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.

πŸ”₯ News

  • [2026.2.8] We integrated the evaluation of LongCat-Image-Edit into the full benchmark comparison table.
  • [2026.2.5] UnicEdit-10M released.
  • [2025.12.2] Code and benchmark released.
  • [2025.12.2] Paper released on arXiv.

βœ… TODO

  • Release UnicBench evaluation code
  • Release benchmark test data
  • Release UnicEdit-10M dataset
  • Release Qwen-Verify model
  • Release data generation pipeline

🎯 Highlights

  • UnicEdit-10M: A quality-aware data curation pipeline with unified post-verification and a 10M-scale high-quality image editing dataset with diverse basic and complex editing tasks.
  • Qwen-Verify: A 7B dual-task expert model for efficient failure detection and instruction recaptioning.
  • UnicBench: A comprehensive benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) for fine-grained diagnosis.

πŸ“Š Data Pipeline

πŸ–ΌοΈ Dataset Showcases

πŸ“ Project Structure

UnicBench/
β”œβ”€β”€ assets/                 # Images for README
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ prompts.py          # VLM evaluation prompts (IF, NC, VQ, RA)
β”‚   └── test_data.jsonl     # Benchmark test data
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ eval_pipeline.py    # Main evaluation pipeline
β”‚   └── calculate_scores.py # Score statistics tool
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ gen_samples_flux.py # Generate samples using FLUX
β”‚   └── gen_samples_flux.sh # Shell script for inference
└── models/                 # VLM models for evaluation

πŸ› οΈ Installation

# Create conda environment
conda create -n unicbench python=3.11
conda activate unicbench

# Install dependencies
pip install -r requirements.txt

πŸ“₯ Dataset

UnicEdit-10M Dataset

You can load the UnicEdit-10M dataset directly from Hugging Face using the datasets library:

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("xiaotanhua/UnicEdit-10M")

# Streaming mode (recommended for large datasets)
dataset = load_dataset("xiaotanhua/UnicEdit-10M", streaming=True)

# Access samples
for sample in dataset['train']:
    print(sample['key'])
    print(sample['prompt_en'])
    # sample['src_image'] and sample['edit_image'] are PIL Image objects
    break

UnicBench Benchmark

You can load the UnicBench benchmark directly from Hugging Face using the datasets library:

from datasets import load_dataset

# Load the dataset
ds = load_dataset("xiaotanhua/UnicBench")

# Access data
print(ds['train'][0])

πŸ“ UnicBench

Benchmark Overview

UnicBench consists of 1,100 samples across 4 task categories and 22 subtasks:

Task Category Subtasks Samples
Object Editing 7 subtasks 350
Attribute Editing 5 subtasks 250
Scene Editing 5 subtasks 250
Reasoning Editing 5 subtasks 250

Evaluation Metrics

Metric Description
IF (Instruction Following) Measures how well the edit follows the given instruction
NC (Non-edit Consistency) Measures consistency of non-edited regions
VQ (Visual Quality) Measures visual quality and naturalness of edited images
RA (Reasoning Accuracy) Measures reasoning accuracy (only for Reasoning Editing tasks)

πŸš€ Usage

1. Generate Edited Images

First, generate edited images using your image editing model. The output should be saved following this path format:

{save_dir}/{model_name}/{subtask_name}/{language}/{key}.png

We provide reference inference scripts for FLUX.1-Kontext and Qwen-Image-Edit:

bash inference/gen_samples_flux.sh  # for FLUX.1-Kontext
bash inference/gen_samples_qwen.sh  # for Qwen-Image-Edit

The output directory structure must follow the format below:

{save_dir}/
└── {model_name}/
    β”œβ”€β”€ {subtask_name}/{language}/      # Edited images
    └── eval_output/{vlm_name}/
        β”œβ”€β”€ {subtask_name}_{language}_results.jsonl  # Per-sample results
        └── statistics/
            └── {language}_statistics.json           # Aggregated statistics

2. Run Evaluation

Use eval_pipeline.py to evaluate edited images and compute final scores. You can load data from a local JSONL file or directly from Hugging Face.

Option 1: Using Hugging Face Dataset (Recommended)

cd eval

python eval_pipeline.py \
    --data_path xiaotanhua/UnicBench \
    --save_dir /path/to/results \
    --edit_model_name your_model_name \
    --vlm_model_name gpt-4.1 \
    --languages en \
    --num_workers 8

Option 2: Using Local JSONL File

cd eval

python eval_pipeline.py \
    --data_path ../data/test_data.jsonl \
    --image_dir /path/to/benchmark/images \
    --save_dir /path/to/results \
    --edit_model_name your_model_name \
    --vlm_model_name gpt-4.1 \
    --languages en \
    --num_workers 8

Parameters:

Parameter Description
--data_path Path to test data jsonl file OR Hugging Face dataset name (e.g., xiaotanhua/UnicBench)
--image_dir Directory containing original benchmark images (Required for JSONL, Optional for HF dataset)
--save_dir Root directory to save results
--edit_model_name Name of your editing model
--vlm_model_name VLM model for evaluation (default: gpt-4.1-2025-04-14)
--languages Languages to evaluate: en, cn, or both
--num_workers Number of parallel workers (for API-based VLMs)
--skip_evaluation Skip evaluation, only compute statistics

3. Calculate Statistics (Optional)

If evaluation has already been completed and you only need to aggregate statistics, use calculate_scores.py to compute score statistics from evaluation results:

python calculate_scores.py \
    --save_dir /path/to/results \
    --edit_model_name your_model_name \
    --vlm_model_name gpt-4.1 \
    --languages en cn

πŸ“ˆ Benchmark Results

Evaluation results of mainstream image editing models on UnicBench:

Model Overall-EN Overall-CN
IF NC VQ RA Overall IF NC VQ RA Overall
Open-Source Models
Instruct-Pix2Pix 2.8526 4.0983 3.9672 1.9560 2.9221 - - - - -
MagicBrush 2.3403 3.3849 3.4559 1.7240 2.3407 - - - - -
OmniGen2 6.2455 7.4973 6.4891 5.1240 6.1246 - - - - -
UniWorld-v1 5.3055 7.3091 6.4827 4.0160 5.6013 - - - - -
FLUX.1-Kontext 6.7755 8.4718 7.3600 5.5040 6.8045 - - - - -
BAGEL 7.2491 8.1982 7.1391 5.2600 6.9794 7.3018 8.2845 7.3118 5.2840 7.1056
Step1X-Edit-v1.1 6.9945 8.2045 7.3382 5.0400 6.9202 7.0282 8.4118 7.5600 5.0560 7.0620
Qwen-Image-Edit 8.2055 8.0264 8.0745 6.4480 7.7273 8.3718 7.8000 8.2118 6.6560 7.7790
LongCat-Image-Edit 8.6058 8.8321 8.2774 7.3482 8.2344 8.6427 8.9109 8.3500 7.3800 8.2993
Closed-source Models
Nano Banana 7.9753 8.9808 8.1954 6.8680 7.8792 8.1550 9.0438 8.3291 6.8960 8.0358
Seedit 3.0 8.2717 8.4251 7.8392 6.9393 7.8671 8.3721 8.4502 7.9795 6.8395 7.9753
Seedream 4.0 8.3764 8.7200 8.0736 7.5960 8.0428 8.3418 8.6600 8.1364 7.1240 8.0474
GPT-Image-1 9.1551 7.8449 8.6830 8.3392 8.3546 9.2759 7.8906 8.6980 8.2247 8.4506

πŸ“œ Citation

@article{ye2025unicedit,
  title={UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits},
  author={Ye, Keming and Huang, Zhipeng and Fu, Canmiao and Liu, Qingyang and Cai, Jiani and Lv, Zheqi and Li, Chen and Lyu, Jing and Zhao, Zhou and Zhang, Shengyu},
  journal={arXiv preprint arXiv:2512.02790},
  year={2025}
}

πŸ“„ License

This project is released under the Apache 2.0 License.

πŸ™ Acknowledgements

We thank all contributors and the open-source community for their support.

About

UnicEdit-10M and UnicBench project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published