-
Notifications
You must be signed in to change notification settings - Fork 46
Description
We could potentially compress the udata better. I've been researching this a bit, and we could shave a good amount of bytes by changing the data layout and save in base-36 (which is fast for JavaScript to decode with parseInt).
I also think it's an issue that the code points are layout in this binary format: yyyyyxxxxxxxxyyyyyyyy. This makes the x=0 section quite big, but many times you'd only use latin1 characters and not characters outside the BMP. A better format would be xxxxxxxxxxxxxyyyyyyyy. This creates more data rows, but you have to decompress less data in average, based on the assumption that normal text only revolves around a few Unicode scripts. Or maybe we should make a split between the way BMP and outside-BMP is stored.
I just need to look at my research files again and write the points of my research down in this issue.