tilde-nlp · nicemanis · Jun 19, 2025
diff --git a/tasks/llmzszl/README.md b/tasks/llmzszl/README.md
@@ -0,0 +1,32 @@
+# LLMzSzŁ
+
+### Paper
+
+https://arxiv.org/pdf/2501.02266v1
+
+This article introduces the first comprehensive
+benchmark for the Polish language at this scale:
+LLMzSzŁ (LLMs Behind the School Desk).
+It is based on a coherent collection of Polish
+national exams, including both academic and
+professional tests extracted from the archives
+of the Polish Central Examination Board. It
+covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k
+closed-ended questions. We investigate the performance of open-source multilingual, English,
+and Polish LLMs to verify LLMs’ abilities to
+transfer knowledge between languages. Also,
+the correlation between LLMs and humans at
+model accuracy and exam pass rate levels is
+examined. We show that multilingual LLMs
+can obtain superior results over monolingual
+ones; however, monolingual models may be
+beneficial when model size matters. Our analysis highlight the potential of LLMs in assisting
+with exam validation, particularly in identifying anomalies or errors in examination tasks.
+
+### Dataset
+
+https://huggingface.co/datasets/amu-cai/llmzszl-dataset
+
+### Leaderboard
+
+https://huggingface.co/spaces/amu-cai/LLMZSZL_Leaderboard
diff --git a/tasks/llmzszl/llmzszl.yaml b/tasks/llmzszl/llmzszl.yaml
@@ -0,0 +1,19 @@
+task: llmzszl
+dataset_path: amu-cai/llmzszl-dataset
+output_type: multiple_choice
+test_split: test
+doc_to_text: |
+  {{question.strip()}}
+  A. {{answers[0]}}
+  B. {{answers[1]}}
+  C. {{answers[2]}}
+  D. {{answers[3]}}
+  Prawidłowa odpowiedź:
+doc_to_choice: "{{answers}}"
+doc_to_target: "{{correct_answer_index}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0