KRETA Benchmark

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts (EMNLP 2025 Main Conference)
Taebaek Hwang*, Minseo Kim*, Gisang Lee, Seonuk Kim, Hyunjun Eun

Abstract

Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and evaluation benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research.

_{(a) Distribution of samples across 15 domains (inner ring) and 26 image types (outer ring). Dark green and light green segments in the inner ring represent the number of samples associated with System 2 and System 1, respectively. (b) The semi-automated VQA generation pipeline.}

Examples

LeaderBoard

Rank	Model	Release	Type	Overall	System1	System2
1	Gemini-2.0-flash	25.02.05	Closed	85.4	98.0	69.8
2	GPT-4o	24.11.20	Closed	84.6	95.9	70.5
3	Claude-3.5-Sonnet	24.10.22	Closed	80.5	93.4	64.5
4	A.X-4.0-VL-LIGHT (7B)	25.07.31	Open-Source	78.0	95.3	56.5
5	VARCO-VISION-2.0 (14B)	25.07.16	Open-Source	75.4	93.5	53.1
6	Kanana-1.5-V (3B)	25.07.24	Open-Source	75.0	94.0	51.4
7	GPT-4o-mini	24.07.18	Closed	73.3	88.7	54.1

Full Leaderboard (click to expand)

Models	Open-Source	Overall	System1	System2	Gov.	Econ.	Mktg.	Comm.	Edu.	Med.	Tech.	Arts.	Transp.	Tour.	FnB.	Ent.	Life.	Sci.	Hist.
Gemini-2.0-flash (25.02.05)	✘	85.4	98.0	69.8	95.1	95.2	99.3	96.1	96.7	92.2	93.5	98.8	90.4	98.1	93.2	95.2	96.6	44.1	78.3
GPT-4o (24.11.20)	✘	84.6	95.9	70.5	93.5	92.3	97.2	90.3	96.7	91.1	96.7	100.0	84.4	93.5	93.6	97.0	95.1	44.1	93.3
Claude-3.5-Sonnet (24.10.22)	✘	80.5	93.4	64.5	93.5	91.3	92.4	87.0	93.0	91.1	87.0	91.6	84.4	94.4	89.8	92.3	92.2	37.4	70.0
A.X-4.0-VL-LIGHT (25.07.31)	✅	78.0	95.3	56.5	90.2	87.5	91.7	89.6	94.0	88.9	87.0	92.8	82.0	94.4	86.0	86.3	86.3	33.9	63.3
VARCO-VISION-2.0 (14B) (25.07.16)	✅	75.4	93.5	53.1	90.6	94.2	88.3	88.3	90.7	90.0	88.0	89.2	79.0	87.0	83.3	87.5	92.2	26.8	33.3
KANANA-1.5-V (3B) (25. 07. 24)	✅	75.0	94.0	51.4	86.5	81.7	94.5	84.4	87.9	80.0	80.4	92.8	77.3	93.5	89.4	85.1	86.8	29.7	48.3
GPT-4o-mini (24.07.18)	✘	73.3	88.7	54.1	82.4	82.7	85.5	84.4	87.4	83.3	80.4	89.2	80.2	84.3	81.4	86.3	87.3	30.3	45.0
VARCO-VISION (14B)	✅	72.3	90.9	49.3	81.6	87.5	83.4	83.1	84.2	86.7	84.8	79.5	82.6	83.3	76.1	81.5	85.3	33.7	31.7
Qwen2.5-VL (3B)	✅	71.8	94.2	43.9	81.6	76.9	85.5	77.9	87.4	80.0	79.3	85.5	75.4	84.3	76.9	87.5	83.3	33.9	36.7
InternVL2.5 (8B)	✅	70.8	89.8	47.3	81.6	76.9	85.5	81.8	83.7	81.1	77.2	78.3	76.0	83.3	74.2	78.6	85.8	34.1	38.3
InternVL2.5 (4B)	✅	70.7	90.7	45.9	82.0	76.9	87.6	83.1	83.7	78.9	79.3	79.5	75.4	77.8	69.3	81.0	86.3	33.9	46.7
Qwen2.5-VL (7B)	✅	68.5	94.5	36.1	80.0	77.9	85.5	81.2	87.4	76.7	75.0	89.2	77.8	82.4	77.7	86.3	85.8	15.1	36.7
MiniCPM-o-2.6 (8B)	✅	64.3	84.1	39.9	75.9	83.7	79.3	75.9	76.7	65.6	75.0	73.5	69.5	79.6	67.8	77.4	74.0	25.5	25.0
Ovis1.6-Gemma2 (9B)	✅	58.4	68.9	45.4	64.1	69.2	71.0	72.7	60.9	71.1	67.4	53.0	68.9	75.9	65.2	58.9	63.2	30.5	28.3
LLaVA-OneVision (7B)	✅	54.0	65.1	40.1	64.1	63.5	63.4	63.6	58.6	55.6	64.1	45.8	68.3	65.7	55.3	55.4	55.9	30.8	33.3
Deepseek-VL2-small (2.8B)	✅	53.3	67.3	36.1	61.6	63.5	66.9	63.0	57.2	64.4	68.5	50.6	59.9	63.0	48.9	56.0	57.4	30.8	36.7
Ovis1.6-Llama3.2 (3B)	✅	52.2	62.8	39.1	64.5	69.2	60.7	57.1	55.8	54.4	62.0	51.8	60.5	61.1	56.8	52.4	49.5	30.5	31.7
Deepseek-VL2-tiny (1B)	✅	48.8	60.8	34.0	57.1	55.8	63.4	58.4	51.2	57.8	57.6	45.8	54.5	58.3	43.9	47.0	54.4	30.5	31.7
Phi-3.5-Vision (4.2B)	✅	42.6	52.2	30.8	53.5	55.8	40.0	49.4	43.3	40.0	53.3	50.6	44.3	46.3	42.8	43.5	44.6	27.6	36.7
LLaVA-OneVision (0.5B)	✅	42.3	49.6	33.3	51.8	48.1	47.6	44.8	39.5	50.0	44.6	40.9	49.7	51.9	41.7	44.6	46.1	28.0	31.7
MiniCPM-V-2.6 (8B)	✅	41.0	50.4	29.4	50.2	54.8	50.3	53.2	44.7	41.1	52.2	33.7	43.7	48.1	43.6	45.8	46.1	18.2	25.0

Settings

make setup # default: GPU=0 (installs paddlepaddle CPU version); for GPU OCR, run: make setup GPU=1
make help # print manual

Environment (.env)

Create a .env file at the project root.

# Set only what you need
OPENAI_API_KEY=<your API key>
GOOGLE_API_KEY=<your API key>
CLAUDE_API_KEY=<your API key>

Text-Rich VQA Generation

Before running, prepare input images:

Create data/images and place your images there (default INPUT_DIR), or set INPUT_DIR to your custom folder.

make filter   # 1) filter out low-quality images with OCR Model
make generate # 2) automatically generate VQA using a 4-stage pipeline (options: INPUT_DIR)
make editor   # 3) refine VQA with streamlit-based editor (options: INPUT_DIR, OUTPUT_DIR, SAVE_BATCH)

Evaluation

eval folder contains inference and evaluate scripts for the KRETA.

infer_xxx.py: For model inference
evaluate.py: For evaluating inference results

1. Model Inference

This script loads a specified model and performs inference. To run the script, use the following steps:

cd eval
python infer/infer_gpt.py [MODEL_NAME] [SETTING]

[MODEL_NAME]: Specify the model's name (e.g., gpt-4o-mini, gpt-4o-mini-2024-07-18, etc.).
[SETTING]: Specify the prompt setting (e.g., default, direct).

Example:

python infer/infer_gpt.py gpt-4o-mini default
python infer/infer_hf_vlm.py kakaocorp/kanana-1.5-v-3b-instruct default
python infer/infer_hf_vlm.py NCSOFT/VARCO-VISION-2.0-14B default
python infer/infer_hf_vlm.py skt/A.X-4.0-VL-Light default

2. Evaluation

This script evaluates the results generated from the inference step. To run the evaluation, use the following command:

cd eval
python evaluate.py

Once executed, the script will:

Load the inference results from the ./output directory.
Generate and display the evaluation report in the console.
Save the evaluation report to the ./output directory.

Acknowledgement

MMMU-Pro: we would like to thank the authors for providing the codebase that our work builds upon.
This work was supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Ministry of Education, Korea Government, by Seoul National University (Semiconductor-Specialized University), Waddle, and AttentionX.

If you find KRETA useful for your research and applications, please cite using this BibTeX:

@article{hwang2025kreta,
  title={KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts},
  author={Hwang, Taebaek and Kim, Minseo and Lee, Gisang and Kim, Seonuk and Eun, Hyunjun},
  journal={arXiv preprint arXiv:2508.19944},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
eval		eval
images		images
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
filter.py		filter.py
main.py		main.py
requirements.txt		requirements.txt
vqa_editor.py		vqa_editor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KRETA Benchmark

Abstract

Examples

LeaderBoard

Settings

Environment (.env)

Text-Rich VQA Generation

Evaluation

1. Model Inference

2. Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

tabtoyou/KRETA

Folders and files

Latest commit

History

Repository files navigation

KRETA Benchmark

Abstract

Examples

LeaderBoard

Settings

Environment (.env)

Text-Rich VQA Generation

Evaluation

1. Model Inference

2. Evaluation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages