endOfWordSuffix does not take effect in BPE model

If the endOfWordSuffix is not nil, the BPE tokenizer will not add the endOfWordSuffix at the end of the word.

For example:
If we work in python, the "Hello World!" will be splited into tokens as ['<|startoftext|>', 'hello</w>', 'world</w>', '!</w>', '<|endoftext|>'].
But if we work in this project, the result will be ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].

This bug occurs in the method MergeWord of BPE. 
The line `currRuneIdx++` appears after the `if currRuneIdx == len(chars)` branch, so that the `if currRuneIdx == len(chars)` branch will never toke effect. And the `if currRuneIdx == len(chars)` branch is after the `if byteIdx == 0` branch, that results in single-letter words never being combined with "endOfWordSuffix".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

endOfWordSuffix does not take effect in BPE model #65

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

endOfWordSuffix does not take effect in BPE model #65

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions