GitHub - NationalLibraryOfNorway/EuroEval: The robust European language model benchmark.

The robust European language model benchmark

(formerly known as ScandEval)

Maintainer

Dan Saattrup Smart (@saattrupdan, dan.smart@alexandra.dk)

Installation and usage

See the documentation for more information.

Configurable Finetuning Hyperparameters

EuroEval exposes comprehensive finetuning hyperparameters that can be customized both via CLI and programmatically. All parameters have sensible defaults based on standard best practices.

Core Training Parameters

Parameter	Default	Description
`--learning-rate`	`2e-5`	Maximum learning rate for training
`--warmup-ratio`	`0.01`	Fraction of training steps for learning rate warmup
`--finetuning-batch-size`	`32`	Training batch size per device
`--max-steps`	`10000`	Maximum number of training steps

Evaluation & Checkpointing

Parameter	Default	Description
`--eval-steps`	`30`	How often to evaluate the model during training
`--save-steps`	`30`	How often to save model checkpoints
`--save-total-limit`	`1`	Maximum number of checkpoints to keep
`--logging-steps`	`30`	How often to log training metrics

Optimization & Regularization

Parameter	Default	Description
`--optimizer-name`	`adamw_torch`	Optimizer to use (adamw_torch, adamw_8bit, sgd, etc.)
`--weight-decay`	`0.0`	L2 regularization coefficient
`--lr-scheduler-type`	`linear`	LR scheduler type (linear, cosine, polynomial, etc.)

Gradient Accumulation & Early Stopping

Parameter	Default	Description
`--gradient-accumulation-base`	`32`	Base value for computing gradient accumulation (final: base ÷ batch_size)
`--eval-accumulation-steps`	`32`	Accumulation steps for evaluation
`--early-stopping-patience`	`5`	Evaluation steps with no improvement before stopping
`--per-device-eval-batch-size`	`None`	Separate eval batch size (if None, uses training batch size)

Example Usage

# Default hyperparameters
euroeval-benchmark --model bert-base-uncased --language en

# Custom learning rate and batch size
euroeval-benchmark --model bert-base-uncased --learning-rate 5e-5 --finetuning-batch-size 16

# Aggressive early stopping with custom optimizer
euroeval-benchmark --model bert-base-uncased \
  --early-stopping-patience 3 \
  --optimizer-name adamw_8bit \
  --weight-decay 0.01

# Cosine annealing with custom evaluation frequency
euroeval-benchmark --model bert-base-uncased \
  --lr-scheduler-type cosine \
  --eval-steps 50 \
  --logging-steps 50

# Different eval batch size for memory efficiency
euroeval-benchmark --model bert-base-uncased \
  --finetuning-batch-size 32 \
  --per-device-eval-batch-size 64

Programmatic Usage

from euroeval import Benchmarker

benchmarker = Benchmarker(
    learning_rate=5e-5,
    warmup_ratio=0.1,
    finetuning_batch_size=16,
    max_steps=5000,
    eval_steps=50,
    save_steps=50,
    optimizer_name="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    early_stopping_patience=3,
)

benchmarker.benchmark(model=["bert-base-uncased"])

Reproducing the evaluation datasets

All datasets used in this project are generated using the scripts located in the src/scripts folder. To reproduce a dataset, run the corresponding script with the following command

uv run src/scripts/<name-of-script>.py

Replace with the specific script you wish to execute, e.g.,

uv run src/scripts/create_allocine.py

Contributors 🙏

A huge thank you to all the contributors who have helped make this project a success!

Contribute to EuroEval

We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or contributing new datasets, your help makes this project better for everyone.

General contributions: Check out our contribution guidelines for information on how to get started.
Adding datasets: If you're interested in adding a new dataset to EuroEval, we have a dedicated guide with step-by-step instructions.

Special thanks

Thanks to Google for sponsoring Gemini credits as part of their Google Cloud for Researchers Program.
Thanks @Mikeriess for evaluating many of the larger models on the leaderboards.
Thanks to OpenAI for sponsoring OpenAI credits as part of their Researcher Access Program.
Thanks to UWV and KU Leuven for sponsoring the Azure OpenAI credits used to evaluate GPT-4-turbo in Dutch.
Thanks to Miðeind for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
Thanks to CHC for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in German.

Citing EuroEval

If you want to cite the framework then feel free to use this:

@article{smart2024encoder,
  title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
  author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
  journal={arXiv preprint arXiv:2406.13469},
  year={2024}
}
@inproceedings{smart2023scandeval,
  author = {Smart, Dan Saattrup},
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  month = may,
  pages = {185--201},
  title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
  year = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4,031 Commits
.github		.github
docs		docs
gfx		gfx
src		src
sweep		sweep
sweep_refactor		sweep_refactor
sweep_v2		sweep_v2
tests		tests
.gitignore		.gitignore
.markdownlint.jsonc		.markdownlint.jsonc
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NEW_DATASET_GUIDE.md		NEW_DATASET_GUIDE.md
README.md		README.md
makefile		makefile
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The robust European language model benchmark

Maintainer

Installation and usage

Configurable Finetuning Hyperparameters

Core Training Parameters

Evaluation & Checkpointing

Optimization & Regularization

Gradient Accumulation & Early Stopping

Example Usage

Programmatic Usage

Reproducing the evaluation datasets

Contributors 🙏

Contribute to EuroEval

Special thanks

Citing EuroEval

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The robust European language model benchmark

Maintainer

Installation and usage

Configurable Finetuning Hyperparameters

Core Training Parameters

Evaluation & Checkpointing

Optimization & Regularization

Gradient Accumulation & Early Stopping

Example Usage

Programmatic Usage

Reproducing the evaluation datasets

Contributors 🙏

Contribute to EuroEval

Special thanks

Citing EuroEval

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages