🤗 KRETA | 📖 Paper | 🏆 Leaderboard
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts (EMNLP 2025 Main Conference)
Taebaek Hwang*, Minseo Kim*, Gisang Lee,
Seonuk Kim, Hyunjun Eun
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and evaluation benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research.
(a) Distribution of samples across 15 domains (inner ring) and 26 image types (outer ring). Dark green and light green segments in the inner ring represent the number of samples associated with System 2 and System 1, respectively. (b) The semi-automated VQA generation pipeline.
| Rank | Model | Release | Type | Overall | System1 | System2 |
|---|---|---|---|---|---|---|
| 1 | Gemini-2.0-flash | 25.02.05 | Closed | 85.4 | 98.0 | 69.8 |
| 2 | GPT-4o | 24.11.20 | Closed | 84.6 | 95.9 | 70.5 |
| 3 | Claude-3.5-Sonnet | 24.10.22 | Closed | 80.5 | 93.4 | 64.5 |
| 4 | A.X-4.0-VL-LIGHT (7B) | 25.07.31 | Open-Source | 78.0 | 95.3 | 56.5 |
| 5 | VARCO-VISION-2.0 (14B) | 25.07.16 | Open-Source | 75.4 | 93.5 | 53.1 |
| 6 | Kanana-1.5-V (3B) | 25.07.24 | Open-Source | 75.0 | 94.0 | 51.4 |
| 7 | GPT-4o-mini | 24.07.18 | Closed | 73.3 | 88.7 | 54.1 |
Full Leaderboard (click to expand)
| Models | Open-Source | Overall | System1 | System2 | Gov. | Econ. | Mktg. | Comm. | Edu. | Med. | Tech. | Arts. | Transp. | Tour. | FnB. | Ent. | Life. | Sci. | Hist. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini-2.0-flash (25.02.05) | ✘ | 85.4 | 98.0 | 69.8 | 95.1 | 95.2 | 99.3 | 96.1 | 96.7 | 92.2 | 93.5 | 98.8 | 90.4 | 98.1 | 93.2 | 95.2 | 96.6 | 44.1 | 78.3 |
| GPT-4o (24.11.20) | ✘ | 84.6 | 95.9 | 70.5 | 93.5 | 92.3 | 97.2 | 90.3 | 96.7 | 91.1 | 96.7 | 100.0 | 84.4 | 93.5 | 93.6 | 97.0 | 95.1 | 44.1 | 93.3 |
| Claude-3.5-Sonnet (24.10.22) | ✘ | 80.5 | 93.4 | 64.5 | 93.5 | 91.3 | 92.4 | 87.0 | 93.0 | 91.1 | 87.0 | 91.6 | 84.4 | 94.4 | 89.8 | 92.3 | 92.2 | 37.4 | 70.0 |
| A.X-4.0-VL-LIGHT (25.07.31) | ✅ | 78.0 | 95.3 | 56.5 | 90.2 | 87.5 | 91.7 | 89.6 | 94.0 | 88.9 | 87.0 | 92.8 | 82.0 | 94.4 | 86.0 | 86.3 | 86.3 | 33.9 | 63.3 |
| VARCO-VISION-2.0 (14B) (25.07.16) | ✅ | 75.4 | 93.5 | 53.1 | 90.6 | 94.2 | 88.3 | 88.3 | 90.7 | 90.0 | 88.0 | 89.2 | 79.0 | 87.0 | 83.3 | 87.5 | 92.2 | 26.8 | 33.3 |
| KANANA-1.5-V (3B) (25. 07. 24) | ✅ | 75.0 | 94.0 | 51.4 | 86.5 | 81.7 | 94.5 | 84.4 | 87.9 | 80.0 | 80.4 | 92.8 | 77.3 | 93.5 | 89.4 | 85.1 | 86.8 | 29.7 | 48.3 |
| GPT-4o-mini (24.07.18) | ✘ | 73.3 | 88.7 | 54.1 | 82.4 | 82.7 | 85.5 | 84.4 | 87.4 | 83.3 | 80.4 | 89.2 | 80.2 | 84.3 | 81.4 | 86.3 | 87.3 | 30.3 | 45.0 |
| VARCO-VISION (14B) | ✅ | 72.3 | 90.9 | 49.3 | 81.6 | 87.5 | 83.4 | 83.1 | 84.2 | 86.7 | 84.8 | 79.5 | 82.6 | 83.3 | 76.1 | 81.5 | 85.3 | 33.7 | 31.7 |
| Qwen2.5-VL (3B) | ✅ | 71.8 | 94.2 | 43.9 | 81.6 | 76.9 | 85.5 | 77.9 | 87.4 | 80.0 | 79.3 | 85.5 | 75.4 | 84.3 | 76.9 | 87.5 | 83.3 | 33.9 | 36.7 |
| InternVL2.5 (8B) | ✅ | 70.8 | 89.8 | 47.3 | 81.6 | 76.9 | 85.5 | 81.8 | 83.7 | 81.1 | 77.2 | 78.3 | 76.0 | 83.3 | 74.2 | 78.6 | 85.8 | 34.1 | 38.3 |
| InternVL2.5 (4B) | ✅ | 70.7 | 90.7 | 45.9 | 82.0 | 76.9 | 87.6 | 83.1 | 83.7 | 78.9 | 79.3 | 79.5 | 75.4 | 77.8 | 69.3 | 81.0 | 86.3 | 33.9 | 46.7 |
| Qwen2.5-VL (7B) | ✅ | 68.5 | 94.5 | 36.1 | 80.0 | 77.9 | 85.5 | 81.2 | 87.4 | 76.7 | 75.0 | 89.2 | 77.8 | 82.4 | 77.7 | 86.3 | 85.8 | 15.1 | 36.7 |
| MiniCPM-o-2.6 (8B) | ✅ | 64.3 | 84.1 | 39.9 | 75.9 | 83.7 | 79.3 | 75.9 | 76.7 | 65.6 | 75.0 | 73.5 | 69.5 | 79.6 | 67.8 | 77.4 | 74.0 | 25.5 | 25.0 |
| Ovis1.6-Gemma2 (9B) | ✅ | 58.4 | 68.9 | 45.4 | 64.1 | 69.2 | 71.0 | 72.7 | 60.9 | 71.1 | 67.4 | 53.0 | 68.9 | 75.9 | 65.2 | 58.9 | 63.2 | 30.5 | 28.3 |
| LLaVA-OneVision (7B) | ✅ | 54.0 | 65.1 | 40.1 | 64.1 | 63.5 | 63.4 | 63.6 | 58.6 | 55.6 | 64.1 | 45.8 | 68.3 | 65.7 | 55.3 | 55.4 | 55.9 | 30.8 | 33.3 |
| Deepseek-VL2-small (2.8B) | ✅ | 53.3 | 67.3 | 36.1 | 61.6 | 63.5 | 66.9 | 63.0 | 57.2 | 64.4 | 68.5 | 50.6 | 59.9 | 63.0 | 48.9 | 56.0 | 57.4 | 30.8 | 36.7 |
| Ovis1.6-Llama3.2 (3B) | ✅ | 52.2 | 62.8 | 39.1 | 64.5 | 69.2 | 60.7 | 57.1 | 55.8 | 54.4 | 62.0 | 51.8 | 60.5 | 61.1 | 56.8 | 52.4 | 49.5 | 30.5 | 31.7 |
| Deepseek-VL2-tiny (1B) | ✅ | 48.8 | 60.8 | 34.0 | 57.1 | 55.8 | 63.4 | 58.4 | 51.2 | 57.8 | 57.6 | 45.8 | 54.5 | 58.3 | 43.9 | 47.0 | 54.4 | 30.5 | 31.7 |
| Phi-3.5-Vision (4.2B) | ✅ | 42.6 | 52.2 | 30.8 | 53.5 | 55.8 | 40.0 | 49.4 | 43.3 | 40.0 | 53.3 | 50.6 | 44.3 | 46.3 | 42.8 | 43.5 | 44.6 | 27.6 | 36.7 |
| LLaVA-OneVision (0.5B) | ✅ | 42.3 | 49.6 | 33.3 | 51.8 | 48.1 | 47.6 | 44.8 | 39.5 | 50.0 | 44.6 | 40.9 | 49.7 | 51.9 | 41.7 | 44.6 | 46.1 | 28.0 | 31.7 |
| MiniCPM-V-2.6 (8B) | ✅ | 41.0 | 50.4 | 29.4 | 50.2 | 54.8 | 50.3 | 53.2 | 44.7 | 41.1 | 52.2 | 33.7 | 43.7 | 48.1 | 43.6 | 45.8 | 46.1 | 18.2 | 25.0 |
make setup # default: GPU=0 (installs paddlepaddle CPU version); for GPU OCR, run: make setup GPU=1
make help # print manualCreate a .env file at the project root.
# Set only what you need
OPENAI_API_KEY=<your API key>
GOOGLE_API_KEY=<your API key>
CLAUDE_API_KEY=<your API key>Before running, prepare input images:
- Create
data/imagesand place your images there (defaultINPUT_DIR), or setINPUT_DIRto your custom folder.
make filter # 1) filter out low-quality images with OCR Model
make generate # 2) automatically generate VQA using a 4-stage pipeline (options: INPUT_DIR)
make editor # 3) refine VQA with streamlit-based editor (options: INPUT_DIR, OUTPUT_DIR, SAVE_BATCH)eval folder contains inference and evaluate scripts for the KRETA.
infer_xxx.py: For model inferenceevaluate.py: For evaluating inference results
This script loads a specified model and performs inference. To run the script, use the following steps:
cd eval
python infer/infer_gpt.py [MODEL_NAME] [SETTING][MODEL_NAME]: Specify the model's name (e.g.,gpt-4o-mini,gpt-4o-mini-2024-07-18, etc.).[SETTING]: Specify the prompt setting (e.g.,default,direct).
Example:
python infer/infer_gpt.py gpt-4o-mini default
python infer/infer_hf_vlm.py kakaocorp/kanana-1.5-v-3b-instruct default
python infer/infer_hf_vlm.py NCSOFT/VARCO-VISION-2.0-14B default
python infer/infer_hf_vlm.py skt/A.X-4.0-VL-Light defaultThis script evaluates the results generated from the inference step. To run the evaluation, use the following command:
cd eval
python evaluate.pyOnce executed, the script will:
- Load the inference results from the
./outputdirectory. - Generate and display the evaluation report in the console.
- Save the evaluation report to the
./outputdirectory.
- MMMU-Pro: we would like to thank the authors for providing the codebase that our work builds upon.
- This work was supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Ministry of Education, Korea Government, by Seoul National University (Semiconductor-Specialized University), Waddle, and AttentionX.
If you find KRETA useful for your research and applications, please cite using this BibTeX:
@article{hwang2025kreta,
title={KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts},
author={Hwang, Taebaek and Kim, Minseo and Lee, Gisang and Kim, Seonuk and Eun, Hyunjun},
journal={arXiv preprint arXiv:2508.19944},
year={2025}
}