create vocab.txt from corpus
$ pip install -r requirements.txt
can receive only one corpus file
$ python SPM.py --help or -h (for detail)
$ python SPM.py --corpus .../corpus.txt --size 32000 --output .../vocab.txt
can receive multi corpus files
$ pip install tokenizers
$ python WPM.py --help or -h (for detail)
$ python WPM.py --corpus .../corpus.txt .../corpus2.txt --size 32000 --output .../vocab.txt
can receive multi corpus files
$ pip install tokenizers==0.7.0 (default)
$ python WPM2.py --help or -h (for detail)
$ python WPM2.py --corpus .../corpus.txt .../corpus2.txt --size 32000 --limit_alphabet 6000 --output .../vocab.txt