Skip to content

NationalLibraryOfNorway/EuroEval

 
 

Repository files navigation

The robust European language model benchmark

(formerly known as ScandEval)


Documentation PyPI Status First paper Second paper License LastCommit Code Coverage Contributor Covenant

Maintainer

Installation and usage

See the documentation for more information.

Configurable Finetuning Hyperparameters

EuroEval exposes comprehensive finetuning hyperparameters that can be customized both via CLI and programmatically. All parameters have sensible defaults based on standard best practices.

Core Training Parameters

Parameter Default Description
--learning-rate 2e-5 Maximum learning rate for training
--warmup-ratio 0.01 Fraction of training steps for learning rate warmup
--finetuning-batch-size 32 Training batch size per device
--max-steps 10000 Maximum number of training steps

Evaluation & Checkpointing

Parameter Default Description
--eval-steps 30 How often to evaluate the model during training
--save-steps 30 How often to save model checkpoints
--save-total-limit 1 Maximum number of checkpoints to keep
--logging-steps 30 How often to log training metrics

Optimization & Regularization

Parameter Default Description
--optimizer-name adamw_torch Optimizer to use (adamw_torch, adamw_8bit, sgd, etc.)
--weight-decay 0.0 L2 regularization coefficient
--lr-scheduler-type linear LR scheduler type (linear, cosine, polynomial, etc.)

Gradient Accumulation & Early Stopping

Parameter Default Description
--gradient-accumulation-base 32 Base value for computing gradient accumulation (final: base ÷ batch_size)
--eval-accumulation-steps 32 Accumulation steps for evaluation
--early-stopping-patience 5 Evaluation steps with no improvement before stopping
--per-device-eval-batch-size None Separate eval batch size (if None, uses training batch size)

Example Usage

# Default hyperparameters
euroeval-benchmark --model bert-base-uncased --language en

# Custom learning rate and batch size
euroeval-benchmark --model bert-base-uncased --learning-rate 5e-5 --finetuning-batch-size 16

# Aggressive early stopping with custom optimizer
euroeval-benchmark --model bert-base-uncased \
  --early-stopping-patience 3 \
  --optimizer-name adamw_8bit \
  --weight-decay 0.01

# Cosine annealing with custom evaluation frequency
euroeval-benchmark --model bert-base-uncased \
  --lr-scheduler-type cosine \
  --eval-steps 50 \
  --logging-steps 50

# Different eval batch size for memory efficiency
euroeval-benchmark --model bert-base-uncased \
  --finetuning-batch-size 32 \
  --per-device-eval-batch-size 64

Programmatic Usage

from euroeval import Benchmarker

benchmarker = Benchmarker(
    learning_rate=5e-5,
    warmup_ratio=0.1,
    finetuning_batch_size=16,
    max_steps=5000,
    eval_steps=50,
    save_steps=50,
    optimizer_name="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    early_stopping_patience=3,
)

benchmarker.benchmark(model=["bert-base-uncased"])

Reproducing the evaluation datasets

All datasets used in this project are generated using the scripts located in the src/scripts folder. To reproduce a dataset, run the corresponding script with the following command

uv run src/scripts/<name-of-script>.py

Replace with the specific script you wish to execute, e.g.,

uv run src/scripts/create_allocine.py

Contributors 🙏

A huge thank you to all the contributors who have helped make this project a success!

Contributor avatar for peter-sk Contributor avatar for AJDERS Contributor avatar for oliverkinch Contributor avatar for versae Contributor avatar for KennethEnevoldsen Contributor avatar for viggo-gascou Contributor avatar for mathiasesn Contributor avatar for Alkarex Contributor avatar for marksverdhei Contributor avatar for Mikeriess Contributor avatar for ThomasKluiters Contributor avatar for BramVanroy Contributor avatar for peregilk Contributor avatar for Rijgersberg Contributor avatar for duarteocarmo Contributor avatar for slowwavesleep Contributor avatar for mrkowalski Contributor avatar for simonevanbruggen Contributor avatar for tvosch Contributor avatar for Touzen Contributor avatar for caldaibis Contributor avatar for SwekeR-463 Contributor avatar for N-essuno Contributor avatar for harderj

Contribute to EuroEval

We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or contributing new datasets, your help makes this project better for everyone.

  • General contributions: Check out our contribution guidelines for information on how to get started.
  • Adding datasets: If you're interested in adding a new dataset to EuroEval, we have a dedicated guide with step-by-step instructions.

Special thanks

  • Thanks to Google for sponsoring Gemini credits as part of their Google Cloud for Researchers Program.
  • Thanks @Mikeriess for evaluating many of the larger models on the leaderboards.
  • Thanks to OpenAI for sponsoring OpenAI credits as part of their Researcher Access Program.
  • Thanks to UWV and KU Leuven for sponsoring the Azure OpenAI credits used to evaluate GPT-4-turbo in Dutch.
  • Thanks to Miðeind for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
  • Thanks to CHC for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in German.

Citing EuroEval

If you want to cite the framework then feel free to use this:

@article{smart2024encoder,
  title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
  author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
  journal={arXiv preprint arXiv:2406.13469},
  year={2024}
}
@inproceedings{smart2023scandeval,
  author = {Smart, Dan Saattrup},
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  month = may,
  pages = {185--201},
  title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
  year = {2023}
}

About

The robust European language model benchmark.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.0%
  • Other 1.0%