additional test data for tokenization tests

I didn't check your data carefull, but this is what we use for lingpy's "ipa2tokens" function:

* https://github.com/lingpy/lingpy/blob/master/lingpy/tests/test_data/test_tokenization.tsv

You might want to check against those, as lingpy yields 100% tokenizations on them.