String encoding breaks in several edge cases

The root cause seems to be the string encoder assuming that any character `\ud800` is the first half of a well formed UTF-16 surrogate pair. That assumption fails in the following cases:

* Code points U+E000 to U+FFFF
* Unpaired surrogates

JavaScript strings are not necessarily well formed UTF-16. The code needs to process characters in the range `\ud800` to `\udbff` by checking whether they are followed by a character in the range `\udc00` to `\udfff`, and if not, encoding U+FFFD instead. Anything `\udc00` to `\udfff` by itself should also be encoded as U+FFFD.

For example, `CBOR.encode("\uff08\u9999\u6e2f\uff09")` gives `6bf3928699e6b8aff3929080` rather than the expected `6cefbc88e9a699e6b8afefbc89`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String encoding breaks in several edge cases #27

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

String encoding breaks in several edge cases #27

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions