Skip to content

nlpsoc/Tokenization-Language-Variation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,180 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repo for Tokenization Is Sensitive To Language Variation

This repository contains code, artifacts and links to datasets for the paper "Tokenization Is Sensitive To Language Variation."

Video: https://youtu.be/GnEpTTj4fc8

Code

fit_tokenizer.py is fitting the tokenizers

train_bert.py is training BERT run_sensitive.sh is a shell script demonstrating the evaluation of trained BERT on variation-sensitive tasks
run_robust.sh is a shell script demonstrating the evaluation of trained BERT on variation-robust tasks

run_logreg.py is the logistic regression model
intrinsic_eval.py is the "intrinsic" evaluation of the tokenization models

Currently this repository mainly fulfills documentation purposes. I'm planning to eventually release a slimmed down version of the code, such that it can be used to easily evaluate your own tokenizers. This could take some time, unfortunately.

Datasets

Most scripts are sweat and tears relating to dataset creation (see create_XXX files). Unless you want to understand that process, ignore those files. Find the datasets ready to use on huggingface (except for those where we lack licenses to share):

Fitting Datasets

Training Datasets

Tasks

Variation-Sensitive

Variation-Robust

Requirements

  • Python 3.11.9

requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published