Skip to content

Add "Grapheme" block to the encoder/decoder hirarchy #50

@AmitMY

Description

@AmitMY

Currently, we pool bytes into "words", but that means that words with 4 letters in Hebrew pool 8 bytes, and 4 letters in English pool 4 bytes.
4 of the bytes in Hebrew are also "useless" - they are the same leading byte.

We could work on the "character" (grapheme) level -
We first pool bytes into graphemes - English is 1 byte, Hebrew is 2 bytes, Chinese can be 2, 3, or 4
then we pool the graphemes into words.

This could perhaps create a more stable representation for characters, while not requiring more compute for languages represented as longer number of bytes.

In decoding time, either an additional decoder hierarchy is implemented, or, the decoder always predicts "4 bytes" which are a valid UTF8 character. This might be more complex because it could predict an invalid character (4 leading bytes) but I assume that with training, it will stabilize quickly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions