- Install
xsv - Install
GNU parallel - Optionally install
fzf,jqandbat, too.
pip install -r requirements.txt
For Uzbek, use scripts/download_til.py.
After extracting, create a detokenized version of all of {train,dev,test}.{eng,fin,deu,uzb} using sacremoses:
sacremoses detokenize < input_file > output_fileThe general workflow to run an experiment is the same regardless of language/segmentation method. Here is an example for English - Finnish translation using regular BPE and 32k merge operations.
- Create an
experimentsandeng_{fin,deu,est,uzb}_bindirectories in the root folder.
mkdir experiments eng_{fin,deu,est,uzb}_bin- Set the
randseg_experiment_nameand environment variable.
export randseg_experiment_name=english2finnish_vanillabpe- Set variables for the experiment config file (
randsge_cfg_file) and hyperparameter folder (randseg_hparams_folder)
export randseg_cfg_file=$(realpath config/english2finnish_sweep_vanillabpe_cfg.sh)
export randseg_hparams_folder=$(realpath config/sweep_confitions_32k_1worker)You can also customize your config file if you're running a custom experiment. See ./config/english2*_cfg.sh for inspiration.
- Run the experiment using SLURM with 10 parallel jobs, one for each seed
sbatch -J your_job_name sweep_experiment.sh- Analyze results
After the experiments finish, the scores can be found in test.eval.score.{bleu,chrf} in each experiment folder.