I didn't check your data carefull, but this is what we use for lingpy's "ipa2tokens" function: * https://github.com/lingpy/lingpy/blob/master/lingpy/tests/test_data/test_tokenization.tsv You might want to check against those, as lingpy yields 100% tokenizations on them.