Code and analysis for the EMNLP 2025 paper Confounding Factors in Relating Model Performance to Morphology by Wessel Poelman*, Thomas Bauwens*, and Miryam de Lhoneux.
To be able to run the Python scripts in this repository, it suffices to install the Tokeniser Toolkit (TkTkT) using
pip install "tktkt[github] @ git+https://github.com/bauwenst/TkTkT"which will then automatically install all the necessary nested dependencies.
First we clone the repository and enter it:
git clone "https://github.com/LAGoM-NLP/ConfoundingFactors"
cd ./ConfoundingFactorsThen we create a script anywhere inside this folder that looks like this:
from scripts.perplexity.av import main
from tktkt.models.huggingface.wrapper import HuggingFaceTokeniser
from tktkt.factories.preprocessing import TraditionalPreprocessor
from tktkt.util.types import NamedIterable
from transformers import AutoTokenizer
from datasets import load_dataset
hf_tokeniser = AutoTokenizer.from_pretrained("goldfish-models/fin_latn_1000mb")
hf_dataset = load_dataset("HuggingFaceFW/fineweb-2", "fin_Latn", split="train", streaming=True).take(200_000)
results = main(
tokeniser=HuggingFaceTokeniser(hf_tokeniser),
word_extractor=TraditionalPreprocessor(),
iterable=NamedIterable(iterable=hf_dataset, name=hf_dataset.info.dataset_name).map(lambda example: example["text"]).tqdm()
)
results = results.assemble().pop()
print(results["right_entropic_efficiency_mean"])Note that all the statistics in the results variable are meaningless by themselves.
They are points on a spectrum, useful only for comparing languages, if you control for the corpus and also for the tokeniser training regime.
Finally, we execute the script in an IDE or using a call like
PYTHONPATH=. python ./path/to/script.py@inproceedings{poelman-etal-2025-confounding,
title = "Confounding Factors in Relating Model Performance to Morphology",
author = "Poelman, Wessel and Bauwens, Thomas and de Lhoneux, Miryam",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.369/",
pages = "7273--7298",
}