Skip to content

Confounding Factors in Relating Model Performance to Morphology - EMNLP 2025

License

Notifications You must be signed in to change notification settings

LAGoM-NLP/ConfoundingFactors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Confounding Factors

Code and analysis for the EMNLP 2025 paper Confounding Factors in Relating Model Performance to Morphology by Wessel Poelman*, Thomas Bauwens*, and Miryam de Lhoneux.

Dependencies

To be able to run the Python scripts in this repository, it suffices to install the Tokeniser Toolkit (TkTkT) using

pip install "tktkt[github] @ git+https://github.com/bauwenst/TkTkT"

which will then automatically install all the necessary nested dependencies.

Example: compute AV-based statistics for a HuggingFace dataset and tokeniser

First we clone the repository and enter it:

git clone "https://github.com/LAGoM-NLP/ConfoundingFactors"
cd ./ConfoundingFactors

Then we create a script anywhere inside this folder that looks like this:

from scripts.perplexity.av import main

from tktkt.models.huggingface.wrapper import HuggingFaceTokeniser
from tktkt.factories.preprocessing import TraditionalPreprocessor
from tktkt.util.types import NamedIterable

from transformers import AutoTokenizer
from datasets import load_dataset

hf_tokeniser = AutoTokenizer.from_pretrained("goldfish-models/fin_latn_1000mb")
hf_dataset   = load_dataset("HuggingFaceFW/fineweb-2", "fin_Latn", split="train", streaming=True).take(200_000)

results = main(
    tokeniser=HuggingFaceTokeniser(hf_tokeniser),
    word_extractor=TraditionalPreprocessor(),
    iterable=NamedIterable(iterable=hf_dataset, name=hf_dataset.info.dataset_name).map(lambda example: example["text"]).tqdm()
)
results = results.assemble().pop()
print(results["right_entropic_efficiency_mean"])

Note that all the statistics in the results variable are meaningless by themselves. They are points on a spectrum, useful only for comparing languages, if you control for the corpus and also for the tokeniser training regime.

Finally, we execute the script in an IDE or using a call like

PYTHONPATH=. python ./path/to/script.py

Citation

@inproceedings{poelman-etal-2025-confounding,
    title = "Confounding Factors in Relating Model Performance to Morphology",
    author = "Poelman, Wessel  and Bauwens, Thomas  and de Lhoneux, Miryam",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.369/",
    pages = "7273--7298",
}

About

Confounding Factors in Relating Model Performance to Morphology - EMNLP 2025

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages