-
-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Currently, we pool bytes into "words", but that means that words with 4 letters in Hebrew pool 8 bytes, and 4 letters in English pool 4 bytes.
4 of the bytes in Hebrew are also "useless" - they are the same leading byte.
We could work on the "character" (grapheme) level -
We first pool bytes into graphemes - English is 1 byte, Hebrew is 2 bytes, Chinese can be 2, 3, or 4
then we pool the graphemes into words.
This could perhaps create a more stable representation for characters, while not requiring more compute for languages represented as longer number of bytes.
In decoding time, either an additional decoder hierarchy is implemented, or, the decoder always predicts "4 bytes" which are a valid UTF8 character. This might be more complex because it could predict an invalid character (4 leading bytes) but I assume that with training, it will stabilize quickly.