Randomized BPE for machine translation

Install required packages

External dependencies

Install xsv
Install GNU parallel
Optionally install fzf, jq and bat, too.

Python dependencies

pip install -r requirements.txt

Download the data

For Uzbek, use scripts/download_til.py.

After extracting, create a detokenized version of all of {train,dev,test}.{eng,fin,deu,uzb} using sacremoses:

sacremoses detokenize < input_file > output_file

Run experiments

The general workflow to run an experiment is the same regardless of language/segmentation method. Here is an example for English - Finnish translation using regular BPE and 32k merge operations.

Create an experiments and eng_{fin,deu,est,uzb}_bin directories in the root folder.

mkdir experiments eng_{fin,deu,est,uzb}_bin

Set the randseg_experiment_name and environment variable.

export randseg_experiment_name=english2finnish_vanillabpe

Set variables for the experiment config file (randsge_cfg_file) and hyperparameter folder (randseg_hparams_folder)

export randseg_cfg_file=$(realpath config/english2finnish_sweep_vanillabpe_cfg.sh)
export randseg_hparams_folder=$(realpath config/sweep_confitions_32k_1worker)

You can also customize your config file if you're running a custom experiment. See ./config/english2*_cfg.sh for inspiration.

Run the experiment using SLURM with 10 parallel jobs, one for each seed

sbatch -J your_job_name sweep_experiment.sh

Analyze results

After the experiments finish, the scores can be found in test.eval.score.{bleu,chrf} in each experiment folder.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
scripts		scripts
README.md		README.md
full_experiment.sh		full_experiment.sh
requirements.txt		requirements.txt
sweep_experiment.sh		sweep_experiment.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Randomized BPE for machine translation

Install required packages

External dependencies

Python dependencies

Download the data

Run experiments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bltlab/random-bpe

Folders and files

Latest commit

History

Repository files navigation

Randomized BPE for machine translation

Install required packages

External dependencies

Python dependencies

Download the data

Run experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages