GitHub - nlpsoc/Tokenization-Language-Variation

Repo for Tokenization Is Sensitive To Language Variation

This repository contains code, artifacts and links to datasets for the paper "Tokenization Is Sensitive To Language Variation."

Video: https://youtu.be/GnEpTTj4fc8

Code

fit_tokenizer.py is fitting the tokenizers

train_bert.py is training BERT run_sensitive.sh is a shell script demonstrating the evaluation of trained BERT on variation-sensitive tasks
run_robust.sh is a shell script demonstrating the evaluation of trained BERT on variation-robust tasks

run_logreg.py is the logistic regression model
intrinsic_eval.py is the "intrinsic" evaluation of the tokenization models

Currently this repository mainly fulfills documentation purposes. I'm planning to eventually release a slimmed down version of the code, such that it can be used to easily evaluate your own tokenizers. This could take some time, unfortunately.

Datasets

Most scripts are sweat and tears relating to dataset creation (see create_XXX files). Unless you want to understand that process, ignore those files. Find the datasets ready to use on huggingface (except for those where we lack licenses to share):

Fitting Datasets

Wikipedia: https://huggingface.co/datasets/AnnaWegmann/Fitting-Wikipedia
PubMed: https://huggingface.co/datasets/AnnaWegmann/Fitting-PubMed
Twitter: cannot be shared due to licensing constraints

Training Datasets

Miscellaneous: https://huggingface.co/datasets/AnnaWegmann/Training-Misc

Tasks

Variation-Sensitive

Authorship Verification: https://huggingface.co/datasets/AnnaWegmann/AV
PAN: https://huggingface.co/datasets/AnnaWegmann/PAN
CORE: https://huggingface.co/datasets/AnnaWegmann/CORE
NUCLE: cannot be shared due to licensing constraints; you can request access via https://www.comp.nus.edu.sg/~nlp/corpora.html
Dialect: https://huggingface.co/datasets/AnnaWegmann/Dialect

Variation-Robust

GLUE: see https://huggingface.co/datasets/nyu-mll/glue
GLUE+typo: https://huggingface.co/datasets/AnnaWegmann/GLUE-typo
GLUE+dialect: https://huggingface.co/datasets/AnnaWegmann/GLUE-dialect

Requirements

Python 3.11.9

requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 1,180 Commits
src		src
tokenizers		tokenizers
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo for Tokenization Is Sensitive To Language Variation

Code

Datasets

Requirements

About

Uh oh!

Releases

Packages

Languages

nlpsoc/Tokenization-Language-Variation

Folders and files

Latest commit

History

Repository files navigation

Repo for Tokenization Is Sensitive To Language Variation

Code

Datasets

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages