Skip to content

endOfWordSuffix does not take effect in BPE model #65

@season-studio

Description

@season-studio

If the endOfWordSuffix is not nil, the BPE tokenizer will not add the endOfWordSuffix at the end of the word.

For example:
If we work in python, the "Hello World!" will be splited into tokens as ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].
But if we work in this project, the result will be ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].

This bug occurs in the method MergeWord of BPE.
The line currRuneIdx++ appears after the if currRuneIdx == len(chars) branch, so that the if currRuneIdx == len(chars) branch will never toke effect. And the if currRuneIdx == len(chars) branch is after the if byteIdx == 0 branch, that results in single-letter words never being combined with "endOfWordSuffix".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions