-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
If the endOfWordSuffix is not nil, the BPE tokenizer will not add the endOfWordSuffix at the end of the word.
For example:
If we work in python, the "Hello World!" will be splited into tokens as ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].
But if we work in this project, the result will be ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].
This bug occurs in the method MergeWord of BPE.
The line currRuneIdx++ appears after the if currRuneIdx == len(chars) branch, so that the if currRuneIdx == len(chars) branch will never toke effect. And the if currRuneIdx == len(chars) branch is after the if byteIdx == 0 branch, that results in single-letter words never being combined with "endOfWordSuffix".
Metadata
Metadata
Assignees
Labels
No labels