Confounding Factors

Code and analysis for the EMNLP 2025 paper Confounding Factors in Relating Model Performance to Morphology by Wessel Poelman*, Thomas Bauwens*, and Miryam de Lhoneux.

Dependencies

To be able to run the Python scripts in this repository, it suffices to install the Tokeniser Toolkit (TkTkT) using

pip install "tktkt[github] @ git+https://github.com/bauwenst/TkTkT"

which will then automatically install all the necessary nested dependencies.

Example: compute AV-based statistics for a HuggingFace dataset and tokeniser

First we clone the repository and enter it:

git clone "https://github.com/LAGoM-NLP/ConfoundingFactors"
cd ./ConfoundingFactors

Then we create a script anywhere inside this folder that looks like this:

from scripts.perplexity.av import main

from tktkt.models.huggingface.wrapper import HuggingFaceTokeniser
from tktkt.factories.preprocessing import TraditionalPreprocessor
from tktkt.util.types import NamedIterable

from transformers import AutoTokenizer
from datasets import load_dataset

hf_tokeniser = AutoTokenizer.from_pretrained("goldfish-models/fin_latn_1000mb")
hf_dataset   = load_dataset("HuggingFaceFW/fineweb-2", "fin_Latn", split="train", streaming=True).take(200_000)

results = main(
    tokeniser=HuggingFaceTokeniser(hf_tokeniser),
    word_extractor=TraditionalPreprocessor(),
    iterable=NamedIterable(iterable=hf_dataset, name=hf_dataset.info.dataset_name).map(lambda example: example["text"]).tqdm()
)
results = results.assemble().pop()
print(results["right_entropic_efficiency_mean"])

Note that all the statistics in the results variable are meaningless by themselves. They are points on a spectrum, useful only for comparing languages, if you control for the corpus and also for the tokeniser training regime.

Finally, we execute the script in an IDE or using a call like

PYTHONPATH=. python ./path/to/script.py

Citation

@inproceedings{poelman-etal-2025-confounding,
    title = "Confounding Factors in Relating Model Performance to Morphology",
    author = "Poelman, Wessel  and Bauwens, Thomas  and de Lhoneux, Miryam",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.369/",
    pages = "7273--7298",
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
img		img
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Confounding Factors

Dependencies

Example: compute AV-based statistics for a HuggingFace dataset and tokeniser

Citation

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

LAGoM-NLP/ConfoundingFactors

Folders and files

Latest commit

History

Repository files navigation

Confounding Factors

Dependencies

Example: compute AV-based statistics for a HuggingFace dataset and tokeniser

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages