Skip to content
/ KRETA Public

[EMNLP 2025] KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Notifications You must be signed in to change notification settings

tabtoyou/KRETA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KRETA Benchmark

🤗 KRETA | 📖 Paper | 🏆 Leaderboard

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts (EMNLP 2025 Main Conference)
Taebaek Hwang*, Minseo Kim*, Gisang Lee, Seonuk Kim, Hyunjun Eun

Abstract

Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and evaluation benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research.


(a) Distribution of samples across 15 domains (inner ring) and 26 image types (outer ring). Dark green and light green segments in the inner ring represent the number of samples associated with System 2 and System 1, respectively. (b) The semi-automated VQA generation pipeline.

Examples

KRETA_Examples

LeaderBoard

Rank Model Release Type Overall System1 System2
1 Gemini-2.0-flash 25.02.05 Closed 85.4 98.0 69.8
2 GPT-4o 24.11.20 Closed 84.6 95.9 70.5
3 Claude-3.5-Sonnet 24.10.22 Closed 80.5 93.4 64.5
4 A.X-4.0-VL-LIGHT (7B) 25.07.31 Open-Source 78.0 95.3 56.5
5 VARCO-VISION-2.0 (14B) 25.07.16 Open-Source 75.4 93.5 53.1
6 Kanana-1.5-V (3B) 25.07.24 Open-Source 75.0 94.0 51.4
7 GPT-4o-mini 24.07.18 Closed 73.3 88.7 54.1
Full Leaderboard (click to expand)
Models Open-Source Overall System1 System2 Gov. Econ. Mktg. Comm. Edu. Med. Tech. Arts. Transp. Tour. FnB. Ent. Life. Sci. Hist.
Gemini-2.0-flash (25.02.05) 85.4 98.0 69.8 95.1 95.2 99.3 96.1 96.7 92.2 93.5 98.8 90.4 98.1 93.2 95.2 96.6 44.1 78.3
GPT-4o (24.11.20) 84.6 95.9 70.5 93.5 92.3 97.2 90.3 96.7 91.1 96.7 100.0 84.4 93.5 93.6 97.0 95.1 44.1 93.3
Claude-3.5-Sonnet (24.10.22) 80.5 93.4 64.5 93.5 91.3 92.4 87.0 93.0 91.1 87.0 91.6 84.4 94.4 89.8 92.3 92.2 37.4 70.0
A.X-4.0-VL-LIGHT (25.07.31) 78.0 95.3 56.5 90.2 87.5 91.7 89.6 94.0 88.9 87.0 92.8 82.0 94.4 86.0 86.3 86.3 33.9 63.3
VARCO-VISION-2.0 (14B) (25.07.16) 75.4 93.5 53.1 90.6 94.2 88.3 88.3 90.7 90.0 88.0 89.2 79.0 87.0 83.3 87.5 92.2 26.8 33.3
KANANA-1.5-V (3B) (25. 07. 24) 75.0 94.0 51.4 86.5 81.7 94.5 84.4 87.9 80.0 80.4 92.8 77.3 93.5 89.4 85.1 86.8 29.7 48.3
GPT-4o-mini (24.07.18) 73.3 88.7 54.1 82.4 82.7 85.5 84.4 87.4 83.3 80.4 89.2 80.2 84.3 81.4 86.3 87.3 30.3 45.0
VARCO-VISION (14B) 72.3 90.9 49.3 81.6 87.5 83.4 83.1 84.2 86.7 84.8 79.5 82.6 83.3 76.1 81.5 85.3 33.7 31.7
Qwen2.5-VL (3B) 71.8 94.2 43.9 81.6 76.9 85.5 77.9 87.4 80.0 79.3 85.5 75.4 84.3 76.9 87.5 83.3 33.9 36.7
InternVL2.5 (8B) 70.8 89.8 47.3 81.6 76.9 85.5 81.8 83.7 81.1 77.2 78.3 76.0 83.3 74.2 78.6 85.8 34.1 38.3
InternVL2.5 (4B) 70.7 90.7 45.9 82.0 76.9 87.6 83.1 83.7 78.9 79.3 79.5 75.4 77.8 69.3 81.0 86.3 33.9 46.7
Qwen2.5-VL (7B) 68.5 94.5 36.1 80.0 77.9 85.5 81.2 87.4 76.7 75.0 89.2 77.8 82.4 77.7 86.3 85.8 15.1 36.7
MiniCPM-o-2.6 (8B) 64.3 84.1 39.9 75.9 83.7 79.3 75.9 76.7 65.6 75.0 73.5 69.5 79.6 67.8 77.4 74.0 25.5 25.0
Ovis1.6-Gemma2 (9B) 58.4 68.9 45.4 64.1 69.2 71.0 72.7 60.9 71.1 67.4 53.0 68.9 75.9 65.2 58.9 63.2 30.5 28.3
LLaVA-OneVision (7B) 54.0 65.1 40.1 64.1 63.5 63.4 63.6 58.6 55.6 64.1 45.8 68.3 65.7 55.3 55.4 55.9 30.8 33.3
Deepseek-VL2-small (2.8B) 53.3 67.3 36.1 61.6 63.5 66.9 63.0 57.2 64.4 68.5 50.6 59.9 63.0 48.9 56.0 57.4 30.8 36.7
Ovis1.6-Llama3.2 (3B) 52.2 62.8 39.1 64.5 69.2 60.7 57.1 55.8 54.4 62.0 51.8 60.5 61.1 56.8 52.4 49.5 30.5 31.7
Deepseek-VL2-tiny (1B) 48.8 60.8 34.0 57.1 55.8 63.4 58.4 51.2 57.8 57.6 45.8 54.5 58.3 43.9 47.0 54.4 30.5 31.7
Phi-3.5-Vision (4.2B) 42.6 52.2 30.8 53.5 55.8 40.0 49.4 43.3 40.0 53.3 50.6 44.3 46.3 42.8 43.5 44.6 27.6 36.7
LLaVA-OneVision (0.5B) 42.3 49.6 33.3 51.8 48.1 47.6 44.8 39.5 50.0 44.6 40.9 49.7 51.9 41.7 44.6 46.1 28.0 31.7
MiniCPM-V-2.6 (8B) 41.0 50.4 29.4 50.2 54.8 50.3 53.2 44.7 41.1 52.2 33.7 43.7 48.1 43.6 45.8 46.1 18.2 25.0

Settings

make setup # default: GPU=0 (installs paddlepaddle CPU version); for GPU OCR, run: make setup GPU=1
make help # print manual

Environment (.env)

Create a .env file at the project root.

# Set only what you need
OPENAI_API_KEY=<your API key>
GOOGLE_API_KEY=<your API key>
CLAUDE_API_KEY=<your API key>

Text-Rich VQA Generation

Before running, prepare input images:

  • Create data/images and place your images there (default INPUT_DIR), or set INPUT_DIR to your custom folder.
make filter   # 1) filter out low-quality images with OCR Model
make generate # 2) automatically generate VQA using a 4-stage pipeline (options: INPUT_DIR)
make editor   # 3) refine VQA with streamlit-based editor (options: INPUT_DIR, OUTPUT_DIR, SAVE_BATCH)

Evaluation

eval folder contains inference and evaluate scripts for the KRETA.

  1. infer_xxx.py: For model inference
  2. evaluate.py: For evaluating inference results

1. Model Inference

This script loads a specified model and performs inference. To run the script, use the following steps:

cd eval
python infer/infer_gpt.py [MODEL_NAME] [SETTING]
  • [MODEL_NAME]: Specify the model's name (e.g., gpt-4o-mini, gpt-4o-mini-2024-07-18, etc.).
  • [SETTING]: Specify the prompt setting (e.g., default, direct).

Example:

python infer/infer_gpt.py gpt-4o-mini default
python infer/infer_hf_vlm.py kakaocorp/kanana-1.5-v-3b-instruct default
python infer/infer_hf_vlm.py NCSOFT/VARCO-VISION-2.0-14B default
python infer/infer_hf_vlm.py skt/A.X-4.0-VL-Light default

2. Evaluation

This script evaluates the results generated from the inference step. To run the evaluation, use the following command:

cd eval
python evaluate.py

Once executed, the script will:

  • Load the inference results from the ./output directory.
  • Generate and display the evaluation report in the console.
  • Save the evaluation report to the ./output directory.

Acknowledgement

  • MMMU-Pro: we would like to thank the authors for providing the codebase that our work builds upon.
  • This work was supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Ministry of Education, Korea Government, by Seoul National University (Semiconductor-Specialized University), Waddle, and AttentionX.

If you find KRETA useful for your research and applications, please cite using this BibTeX:

@article{hwang2025kreta,
  title={KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts},
  author={Hwang, Taebaek and Kim, Minseo and Lee, Gisang and Kim, Seonuk and Eun, Hyunjun},
  journal={arXiv preprint arXiv:2508.19944},
  year={2025}
}

About

[EMNLP 2025] KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5