Skip to content

Add support for alphabet in {"type": "string"}#72

Open
Liam-DeVoe wants to merge 10 commits intomainfrom
str-alphabet
Open

Add support for alphabet in {"type": "string"}#72
Liam-DeVoe wants to merge 10 commits intomainfrom
str-alphabet

Conversation

@Liam-DeVoe
Copy link
Copy Markdown
Member

@Liam-DeVoe Liam-DeVoe commented Mar 28, 2026

Closes #44.

Here's a problem I ran into: cbor uses UTF-8 to encode strings. UTF-8 disallows surrogate code points. We want to be able to generate surrogate code points, because some languages use UTF-16 or raw bytes as their string representation. Therefore the protocol must be able to transport surrogate code points.

I chose to change the representation of all strings in the protocol to a new tag 6, which represents it as WTF-8, which is byte-for-byte equivalent to UTF-8 except it relaxes the UTF-8 well-formed requirement that surrogate code points not appear. Tags 6-15 are reserved for local assignment in the cbor spec.

Every client library will need to understand this and implement a decoder for tag 6. I considered only encoding as tag 6 when a surrogate is present, but I think unifying the representation and forcing libraries to contend with this early is the right choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for configuring alphabet in st.text to the protocol

1 participant