Add "Grapheme" block to the encoder/decoder hirarchy

Currently, we pool bytes into "words", but that means that words with 4 letters in Hebrew pool 8 bytes, and 4 letters in English pool 4 bytes.
4 of the bytes in Hebrew are also "useless" - they are the same leading byte.

We could work on the "character" (grapheme) level - 
We first pool bytes into graphemes - English is 1 byte, Hebrew is 2 bytes, Chinese can be 2, 3, or 4 
then we pool the graphemes into words.

This could perhaps create a more stable representation for characters, while not requiring more compute for languages represented as longer number of bytes.

In decoding time, either an additional decoder hierarchy is implemented,  or, the decoder always predicts "4 bytes" which are a valid UTF8 character. This might be more complex because it *could* predict an invalid character (4 leading bytes) but I assume that with training, it will stabilize quickly. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add "Grapheme" block to the encoder/decoder hirarchy #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add "Grapheme" block to the encoder/decoder hirarchy #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions