UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.
- [2026.2.8] We integrated the evaluation of LongCat-Image-Edit into the full benchmark comparison table.
- [2026.2.5] UnicEdit-10M released.
- [2025.12.2] Code and benchmark released.
- [2025.12.2] Paper released on arXiv.
- Release UnicBench evaluation code
- Release benchmark test data
- Release UnicEdit-10M dataset
- Release Qwen-Verify model
- Release data generation pipeline
- UnicEdit-10M: A quality-aware data curation pipeline with unified post-verification and a 10M-scale high-quality image editing dataset with diverse basic and complex editing tasks.
- Qwen-Verify: A 7B dual-task expert model for efficient failure detection and instruction recaptioning.
- UnicBench: A comprehensive benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) for fine-grained diagnosis.
UnicBench/
βββ assets/ # Images for README
βββ data/
β βββ prompts.py # VLM evaluation prompts (IF, NC, VQ, RA)
β βββ test_data.jsonl # Benchmark test data
βββ eval/
β βββ eval_pipeline.py # Main evaluation pipeline
β βββ calculate_scores.py # Score statistics tool
βββ inference/
β βββ gen_samples_flux.py # Generate samples using FLUX
β βββ gen_samples_flux.sh # Shell script for inference
βββ models/ # VLM models for evaluation
# Create conda environment
conda create -n unicbench python=3.11
conda activate unicbench
# Install dependencies
pip install -r requirements.txtYou can load the UnicEdit-10M dataset directly from Hugging Face using the datasets library:
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("xiaotanhua/UnicEdit-10M")
# Streaming mode (recommended for large datasets)
dataset = load_dataset("xiaotanhua/UnicEdit-10M", streaming=True)
# Access samples
for sample in dataset['train']:
print(sample['key'])
print(sample['prompt_en'])
# sample['src_image'] and sample['edit_image'] are PIL Image objects
breakYou can load the UnicBench benchmark directly from Hugging Face using the datasets library:
from datasets import load_dataset
# Load the dataset
ds = load_dataset("xiaotanhua/UnicBench")
# Access data
print(ds['train'][0])UnicBench consists of 1,100 samples across 4 task categories and 22 subtasks:
| Task Category | Subtasks | Samples |
|---|---|---|
| Object Editing | 7 subtasks | 350 |
| Attribute Editing | 5 subtasks | 250 |
| Scene Editing | 5 subtasks | 250 |
| Reasoning Editing | 5 subtasks | 250 |
| Metric | Description |
|---|---|
| IF (Instruction Following) | Measures how well the edit follows the given instruction |
| NC (Non-edit Consistency) | Measures consistency of non-edited regions |
| VQ (Visual Quality) | Measures visual quality and naturalness of edited images |
| RA (Reasoning Accuracy) | Measures reasoning accuracy (only for Reasoning Editing tasks) |
First, generate edited images using your image editing model. The output should be saved following this path format:
{save_dir}/{model_name}/{subtask_name}/{language}/{key}.png
We provide reference inference scripts for FLUX.1-Kontext and Qwen-Image-Edit:
bash inference/gen_samples_flux.sh # for FLUX.1-Kontext
bash inference/gen_samples_qwen.sh # for Qwen-Image-EditThe output directory structure must follow the format below:
{save_dir}/
βββ {model_name}/
βββ {subtask_name}/{language}/ # Edited images
βββ eval_output/{vlm_name}/
βββ {subtask_name}_{language}_results.jsonl # Per-sample results
βββ statistics/
βββ {language}_statistics.json # Aggregated statistics
Use eval_pipeline.py to evaluate edited images and compute final scores. You can load data from a local JSONL file or directly from Hugging Face.
Option 1: Using Hugging Face Dataset (Recommended)
cd eval
python eval_pipeline.py \
--data_path xiaotanhua/UnicBench \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en \
--num_workers 8Option 2: Using Local JSONL File
cd eval
python eval_pipeline.py \
--data_path ../data/test_data.jsonl \
--image_dir /path/to/benchmark/images \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en \
--num_workers 8Parameters:
| Parameter | Description |
|---|---|
--data_path |
Path to test data jsonl file OR Hugging Face dataset name (e.g., xiaotanhua/UnicBench) |
--image_dir |
Directory containing original benchmark images (Required for JSONL, Optional for HF dataset) |
--save_dir |
Root directory to save results |
--edit_model_name |
Name of your editing model |
--vlm_model_name |
VLM model for evaluation (default: gpt-4.1-2025-04-14) |
--languages |
Languages to evaluate: en, cn, or both |
--num_workers |
Number of parallel workers (for API-based VLMs) |
--skip_evaluation |
Skip evaluation, only compute statistics |
If evaluation has already been completed and you only need to aggregate statistics, use calculate_scores.py to compute score statistics from evaluation results:
python calculate_scores.py \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en cnEvaluation results of mainstream image editing models on UnicBench:
| Model | Overall-EN | Overall-CN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IF | NC | VQ | RA | Overall | IF | NC | VQ | RA | Overall | |
| Open-Source Models | ||||||||||
| Instruct-Pix2Pix | 2.8526 | 4.0983 | 3.9672 | 1.9560 | 2.9221 | - | - | - | - | - |
| MagicBrush | 2.3403 | 3.3849 | 3.4559 | 1.7240 | 2.3407 | - | - | - | - | - |
| OmniGen2 | 6.2455 | 7.4973 | 6.4891 | 5.1240 | 6.1246 | - | - | - | - | - |
| UniWorld-v1 | 5.3055 | 7.3091 | 6.4827 | 4.0160 | 5.6013 | - | - | - | - | - |
| FLUX.1-Kontext | 6.7755 | 8.4718 | 7.3600 | 5.5040 | 6.8045 | - | - | - | - | - |
| BAGEL | 7.2491 | 8.1982 | 7.1391 | 5.2600 | 6.9794 | 7.3018 | 8.2845 | 7.3118 | 5.2840 | 7.1056 |
| Step1X-Edit-v1.1 | 6.9945 | 8.2045 | 7.3382 | 5.0400 | 6.9202 | 7.0282 | 8.4118 | 7.5600 | 5.0560 | 7.0620 |
| Qwen-Image-Edit | 8.2055 | 8.0264 | 8.0745 | 6.4480 | 7.7273 | 8.3718 | 7.8000 | 8.2118 | 6.6560 | 7.7790 |
| LongCat-Image-Edit | 8.6058 | 8.8321 | 8.2774 | 7.3482 | 8.2344 | 8.6427 | 8.9109 | 8.3500 | 7.3800 | 8.2993 |
| Closed-source Models | ||||||||||
| Nano Banana | 7.9753 | 8.9808 | 8.1954 | 6.8680 | 7.8792 | 8.1550 | 9.0438 | 8.3291 | 6.8960 | 8.0358 |
| Seedit 3.0 | 8.2717 | 8.4251 | 7.8392 | 6.9393 | 7.8671 | 8.3721 | 8.4502 | 7.9795 | 6.8395 | 7.9753 |
| Seedream 4.0 | 8.3764 | 8.7200 | 8.0736 | 7.5960 | 8.0428 | 8.3418 | 8.6600 | 8.1364 | 7.1240 | 8.0474 |
| GPT-Image-1 | 9.1551 | 7.8449 | 8.6830 | 8.3392 | 8.3546 | 9.2759 | 7.8906 | 8.6980 | 8.2247 | 8.4506 |
@article{ye2025unicedit,
title={UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits},
author={Ye, Keming and Huang, Zhipeng and Fu, Canmiao and Liu, Qingyang and Cai, Jiani and Lv, Zheqi and Li, Chen and Lyu, Jing and Zhao, Zhou and Zhang, Shengyu},
journal={arXiv preprint arXiv:2512.02790},
year={2025}
}This project is released under the Apache 2.0 License.
We thank all contributors and the open-source community for their support.


