The tokenization logic should be more generic. Could use something like: https://www.atilika.org/ , to tokenize Japanese.