Skip to content

Additional Notes

awesome-llama edited this page Mar 31, 2025 · 13 revisions

What about Unicode characters?

Why only printable ASCII? There are many considerations...

  • UTF-8 is by far the dominant encoding for text and in the format's use case, it's not an option that can be changed. UTF-8 is designed to be compatible with ASCII (which is an encoding not commonly used anymore but is still named as the first 128 characters are encoded identically). The printable ASCII characters (which are part of this overlap in encoding) take up 1 byte each in UTF-8 but other characters take multiple. If the other characters must take up more space, why not just stick to using multiple 1-byte characters? It's made even worse due to the overhead that UTF-8 requires for multiple-byte characters, 5 bits are lost when using 2 bytes for example.
  • Scratch is typically not case-sensitive and there are many characters that Scratch will treat as identical. Costume names are case sensitive so the encoder/decoder uses costumes for each character which is increasingly unwieldy with more characters. There is a way to condense all the characters into a single costume name but this requires a much slower enumeration over the name's characters.
  • There is an increased risk that these other characters will not be supported, limiting the number of places this format can be used.

Special Characters

Due to the format's use of characters such as the backslash and double quotes, it is affected by special purposes of characters that may require escaping to remain represented as text. This may mean the file size is increased slightly. While a more limited character set had been considered, the problem becomes deciding on which characters to not use, but, there are too many conflicts considering the possible places the format could be used in. The format may conflict with text fields that expect markdown or fields that format emojis, quotes, and hyphens. It was therefore decided to go for the general solution and support all the characters.

Additionally note that some sequences of characters may trigger language filters. This could result in corruption if the characters are censored. (No sequence of characters is inherently bad. Their use in an image format being effectively pseudorandom has no negative context... but a bad word filter isn't going to know that)

Lossy Compression

This is not part of the spec. It is possible to implement a form of lossy compression through the use of operations that are within tolerance to the target colour (rather than being exactly equal). This allows smaller operations to be used more frequently thus reducing the file size. The result looks similar to colour quantisation and this basic implementation will ruin gradients. Better methods could involve storing some sort of accumulated error and adjusting the threshold with it.

A data stream dedicated to lossy compression has been considered and may be created in the future.

Clone this wiki locally