Skip to content

Conversation

@JosephDoUrden
Copy link

Summary

Adds support for the x-user-defined encoding to TextDecoder, as required by the WHATWG Encoding Standard and requested in #6039.

Behavior

  • 0x00–0x7F: Decoded as the same code point (ASCII identity).
  • 0x80–0xFF: Decoded to Unicode Private Use Area U+F780–U+F7FF (i.e. 0xF700 + byte).

This gives a simple, reversible single-byte mapping useful for legacy binary-over-string use cases (e.g. when you need an isomorphic byte↔code point mapping; latin1 is not suitable because it is mapped to windows-1252 and is not isomorphic).

Implementation

  • New XUserDefinedDecoder in encoding.h / encoding.c++, with an ASCII-only fast path and a slow path for bytes ≥ 0x80.
  • Label "x-user-defined" is registered in the encoding label table and handled in the TextDecoder constructor (no ICU).
  • Tests: x-user-defined in allTheDecoders, plus dedicated tests in encoding-test.js for decoding, streaming, and fatal mode.

Tests

  • api/tests/encoding-test.js: xUserDefinedDecode, xUserDefinedFatal, and x-user-defined in allTheDecoders.

Fixes #6039

Implements the x-user-defined decoder per WHATWG Encoding Standard.

- Map bytes 0x00–0x7F to identical ASCII code points
- Map bytes 0x80–0xFF to Unicode PUA U+F780–U+F7FF
- Add dedicated XUserDefinedDecoder with ASCII fast path (no ICU)
- Register "x-user-defined" label
- Wire through TextDecoder constructor, getImpl(), and decodePtr()
- Add unit tests for decoding, streaming, and fatal mode

Fixes cloudflare#6039
@JosephDoUrden JosephDoUrden requested review from a team as code owners February 7, 2026 10:45
@github-actions
Copy link

github-actions bot commented Feb 7, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@JosephDoUrden
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Feb 7, 2026
@danlapid
Copy link
Collaborator

danlapid commented Feb 7, 2026

Thanks for your contribution!
@jasnell @anonrig appreciate your review please

@anonrig
Copy link
Member

anonrig commented Feb 7, 2026

Linter and some tests seem to be failing. Can you look into it?

@jasnell
Copy link
Collaborator

jasnell commented Feb 7, 2026

@JosephDoUrden ... to run linting, if you have just installed you can run the linter with a simple just f command, otherwise you can use python3 tools/cross/format.py (which is what just f does)

@jasnell
Copy link
Collaborator

jasnell commented Feb 7, 2026

@anonrig:

Linter and some tests seem to be failing. Can you look into it?

I think only the lint issues are at issue. The test appear to have been a ci glitch.

@JosephDoUrden ... the "run internal build" one is one we'll have to run ourselves, just fyi. Thank you for the contribution!

…flare#6039)

Replace manual byte loop with simdutf::validate_ascii() when detecting
high bytes in XUserDefinedDecoder::decode. Fix JSG_REQUIRE line break
in TextDecoder::constructor to satisfy clang-format.
@JosephDoUrden
Copy link
Author

Formatting has been checked with just f (clang-format, Prettier, ruff, buildifier). All checks passed
@jasnell @anonrig

@JosephDoUrden JosephDoUrden requested a review from anonrig February 8, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TextDecoder is missing x-user-defined encoding

4 participants