Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions tasks/llmzszl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# LLMzSzŁ

### Paper

https://arxiv.org/pdf/2501.02266v1

This article introduces the first comprehensive
benchmark for the Polish language at this scale:
LLMzSzŁ (LLMs Behind the School Desk).
It is based on a coherent collection of Polish
national exams, including both academic and
professional tests extracted from the archives
of the Polish Central Examination Board. It
covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k
closed-ended questions. We investigate the performance of open-source multilingual, English,
and Polish LLMs to verify LLMs’ abilities to
transfer knowledge between languages. Also,
the correlation between LLMs and humans at
model accuracy and exam pass rate levels is
examined. We show that multilingual LLMs
can obtain superior results over monolingual
ones; however, monolingual models may be
beneficial when model size matters. Our analysis highlight the potential of LLMs in assisting
with exam validation, particularly in identifying anomalies or errors in examination tasks.

### Dataset

https://huggingface.co/datasets/amu-cai/llmzszl-dataset

### Leaderboard

https://huggingface.co/spaces/amu-cai/LLMZSZL_Leaderboard
19 changes: 19 additions & 0 deletions tasks/llmzszl/llmzszl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
task: llmzszl
dataset_path: amu-cai/llmzszl-dataset
output_type: multiple_choice
test_split: test
doc_to_text: |
{{question.strip()}}
A. {{answers[0]}}
B. {{answers[1]}}
C. {{answers[2]}}
D. {{answers[3]}}
Prawidłowa odpowiedź:
doc_to_choice: "{{answers}}"
doc_to_target: "{{correct_answer_index}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0