Skip to content

Comments

Extracts UTF-8 valid length position to Bitstring class#719

Draft
mward-sudo wants to merge 6 commits intobartblast:devfrom
mward-sudo:02-18-extracts_valid_utf-8_sequence_length_to_bitstring_class
Draft

Extracts UTF-8 valid length position to Bitstring class#719
mward-sudo wants to merge 6 commits intobartblast:devfrom
mward-sudo:02-18-extracts_valid_utf-8_sequence_length_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 18, 2026

Closes #703

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Added UTF-8 decoding and validation utilities to improve handling of multi-byte and truncated sequences.
  • Refactor

    • Centralized UTF-8 validation logic across Unicode operations for consistency and reduced duplication; normalization flows updated for NFC/NFD/NFKC/NFKD.
  • Tests

    • Added extensive UTF-8 tests covering decoding, validation, truncation, edge cases, and boundary conditions.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Centralizes UTF-8 validation/decoding into six new static Bitstring methods and refactors assets/js/erlang/unicode.mjs to use them instead of duplicated inline scanners. Adds extensive tests covering the new UTF-8 helpers and edge cases.

Changes

Cohort / File(s) Summary
UTF-8 Utilities
assets/js/bitstring.mjs
Adds six static methods for UTF-8 handling: decodeUtf8CodePoint, getValidUtf8Length, isValidUtf8CodePoint, isValidUtf8ContinuationByte, isValidUtf8Sequence, and isTruncatedUtf8Sequence.
Unicode Refactoring
assets/js/erlang/unicode.mjs
Removes in-file UTF-8 prefix scanning and replaces uses with Bitstring.getValidUtf8Length and Bitstring.isTruncatedUtf8Sequence across multiple characters_* and normalization paths; preserves public APIs.
Tests
test/javascript/bitstring_test.mjs
Adds comprehensive tests for the new UTF-8 helpers (1–4 byte sequences, truncation, overlongs, surrogates, continuations) and adjusts/removes duplicate test blocks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main purpose of the PR: extracting UTF-8 validation logic into the Bitstring class, which aligns with the changeset.
Linked Issues check ✅ Passed The PR implements all core objectives from issue #703: centralizes UTF-8 validation, finds longest valid prefix, handles edge cases (overlongs, surrogates, invalid continuations), and eliminates duplication.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #703: adding UTF-8 utilities to Bitstring, refactoring Unicode functions to use them, and adding comprehensive tests. No unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)

255-261: Object literals allocated on every call in hot-path helpers.

firstByteMasks in decodeUtf8CodePoint (Line 259) and minValueForLength in isValidUtf8CodePoint (Line 623) are re-created on each invocation. During getValidUtf8Length's scan of a large byte array, these allocations occur once per sequence. Consider hoisting them to module-level or static private fields to avoid repeated allocation.

♻️ Example: hoist to static private fields
 export default class Bitstring {
   static `#decoder` = ERTS.utf8Decoder;
   static `#encoder` = new TextEncoder("utf-8");
+  static `#utf8FirstByteMasks` = [0, 0, 0x1f, 0x0f, 0x07]; // index = length
+  static `#utf8MinCodePoints` = [0, 0, 0x80, 0x800, 0x10000]; // index = length

Then use $._utf8FirstByteMasks[length] and $._utf8MinCodePoints[length] respectively (adjusting for private-field access within the class).

Also applies to: 621-626

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 261, Hoist the per-call object
literals used for UTF-8 decoding into shared static/module-level constants to
avoid allocating them on every invocation: replace the local firstByteMasks
inside decodeUtf8CodePoint and the minValueForLength inside isValidUtf8CodePoint
with a single shared constant (e.g., BitString._utf8FirstByteMasks and
BitString._utf8MinCodePoints or top-level utf8FirstByteMasks/utf8MinCodePoints)
and update references (decodeUtf8CodePoint, isValidUtf8CodePoint, and any
callers like getValidUtf8Length) to use those static/module constants instead of
recreating the objects each call.
assets/js/erlang/unicode.mjs (1)

280-360: Pre-existing: near-identical scaffolding across four normalization functions.

validateListRest, handleConversionError, and handleInvalidUtf8 are essentially copy-pasted across characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1, differing only in the normalization form string ("NFC", "NFD", etc.). This isn't introduced by this PR, but a follow-up extracting a shared normalizeCharactersToBinary(data, form) helper would eliminate ~200 lines of duplication.

Also applies to: 512-594, 598-679, 683-763

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 280 - 360, The four functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1 and characters_to_nfkd_binary/1 duplicate
validateListRest, handleConversionError and handleInvalidUtf8 with only the
Unicode form string differing; extract a single helper
normalizeCharactersToBinary(data, form) that contains those shared helpers and
logic (use the existing validateListRest, handleConversionError and
handleInvalidUtf8 bodies inside it), accept the normalization form
("NFC"/"NFD"/"NFKC"/"NFKD") and return the same results as the current
implementations, then replace each of the four functions to simply call
normalizeCharactersToBinary(data, "<FORM>") (keeping the same behavior and
return shape). Ensure you reference and reuse Type, Bitstring, Erlang_Unicode
calls exactly as in the originals so behavior is preserved.
test/javascript/bitstring_test.mjs (1)

5393-5470: Avoid duplicated 4-byte truncation case; use a distinct length.
Two tests use the same byte sequence, so one can target a different truncation depth to add coverage.

♻️ Suggested tweak
-    it("returns true for 4-byte sequence with 2 valid continuation bytes (truncated)", () => {
-      // 0xF0 (4-byte) with 2 valid continuation bytes available
-      const bytes = new Uint8Array([0xf0, 0x90, 0x8d]);
+    it("returns true for 4-byte sequence with 1 valid continuation byte (truncated)", () => {
+      // 0xF0 (4-byte) with 1 valid continuation byte available
+      const bytes = new Uint8Array([0xf0, 0x90]);
       assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true);
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/javascript/bitstring_test.mjs` around lines 5393 - 5470, The test suite
has a duplicated 4-byte-truncation input; update the "Edge case: multiple valid
continuation bytes before truncation" test that references
Bitstring.isTruncatedUtf8Sequence so it uses a distinct truncation depth (e.g.,
change its bytes from the duplicate [0xf0, 0x90, 0x8d] to a shorter truncated
4-byte case like [0xf0, 0x90] to represent only one continuation byte available)
and keep the assertion expecting true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 661-676: The isTruncatedUtf8Sequence function can read past the
end of the bytes array when start >= bytes.length; add an explicit guard at the
top of isTruncatedUtf8Sequence to return false if start is out of bounds (start
>= bytes.length or start < 0) before accessing bytes[start], so
getUtf8SequenceLength isn't passed undefined and the rest of the logic remains
correct; reference the isTruncatedUtf8Sequence function and
getUtf8SequenceLength and validate the bytes array/start bounds.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-261: Hoist the per-call object literals used for UTF-8
decoding into shared static/module-level constants to avoid allocating them on
every invocation: replace the local firstByteMasks inside decodeUtf8CodePoint
and the minValueForLength inside isValidUtf8CodePoint with a single shared
constant (e.g., BitString._utf8FirstByteMasks and BitString._utf8MinCodePoints
or top-level utf8FirstByteMasks/utf8MinCodePoints) and update references
(decodeUtf8CodePoint, isValidUtf8CodePoint, and any callers like
getValidUtf8Length) to use those static/module constants instead of recreating
the objects each call.

In `@assets/js/erlang/unicode.mjs`:
- Around line 280-360: The four functions characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1 and
characters_to_nfkd_binary/1 duplicate validateListRest, handleConversionError
and handleInvalidUtf8 with only the Unicode form string differing; extract a
single helper normalizeCharactersToBinary(data, form) that contains those shared
helpers and logic (use the existing validateListRest, handleConversionError and
handleInvalidUtf8 bodies inside it), accept the normalization form
("NFC"/"NFD"/"NFKC"/"NFKD") and return the same results as the current
implementations, then replace each of the four functions to simply call
normalizeCharactersToBinary(data, "<FORM>") (keeping the same behavior and
return shape). Ensure you reference and reuse Type, Bitstring, Erlang_Unicode
calls exactly as in the originals so behavior is preserved.

In `@test/javascript/bitstring_test.mjs`:
- Around line 5393-5470: The test suite has a duplicated 4-byte-truncation
input; update the "Edge case: multiple valid continuation bytes before
truncation" test that references Bitstring.isTruncatedUtf8Sequence so it uses a
distinct truncation depth (e.g., change its bytes from the duplicate [0xf0,
0x90, 0x8d] to a shorter truncated 4-byte case like [0xf0, 0x90] to represent
only one continuation byte available) and keep the assertion expecting true.

@mward-sudo mward-sudo force-pushed the 02-18-extracts_valid_utf-8_sequence_length_to_bitstring_class branch from a9c9f75 to f458bf4 Compare February 18, 2026 23:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
assets/js/erlang/unicode.mjs (1)

320-328: Significant structural duplication across NFC/NFD/NFKC/NFKD handleInvalidUtf8 helpers.

All four handleInvalidUtf8 closures (and their sibling handleConversionError / validateListRest) are structurally identical, differing only in the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). A single shared factory or helper parameterized by the normalization form would eliminate ~100 lines of duplication.

This duplication is largely pre-existing (not introduced by this PR), so deferring is fine — but worth tracking.

Also applies to: 553-562, 639-647, 724-732

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 320 - 328, Multiple closures
(handleInvalidUtf8, handleConversionError, validateListRest) are duplicated for
each normalization form ("NFC","NFD","NFKC","NFKD"); extract a single factory
function (e.g., makeNormalizationHandlers or createNormalizer) that takes the
normalization form string and returns the three handlers, then replace the four
per-form closure groups with calls like const { handleInvalidUtf8,
handleConversionError, validateListRest } = makeNormalizationHandlers("NFC");
ensure the factory uses Bitstring.getValidUtf8Length, Bitstring.fromBytes,
Bitstring.toText and Type.bitstring/Type.tuple exactly as current
implementations do so behavior is unchanged.
assets/js/bitstring.mjs (2)

621-633: Same per-call allocation pattern; consider inlining the lookup.

Similar to decodeUtf8CodePoint, minValueForLength is recreated each call. A ternary chain or static array would avoid the allocation.

♻️ Inline the minimum-value lookup
  static isValidUtf8CodePoint(codePoint, encodingLength) {
-   // Check for overlong encodings (security issue)
-   const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-
-   // Reject code points that could have been encoded with fewer bytes (overlong)
-   if (codePoint < minValueForLength[encodingLength]) return false;
+   // Reject overlong encodings (security issue)
+   const minValue = encodingLength === 2 ? 0x80 : encodingLength === 3 ? 0x800 : encodingLength === 4 ? 0x10000 : 0;
+   if (codePoint < minValue) return false;
    // Reject UTF-16 surrogates (U+D800–U+DFFF)
    if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false;
    // Reject code points beyond Unicode range (> U+10FFFF)
    if (codePoint > 0x10ffff) return false;

    return true;
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 621 - 633, The isValidUtf8CodePoint
function currently allocates a new object minValueForLength on every call;
replace that per-call allocation by inlining the lookup (e.g., a small static
array or ternary/switch) so you return the minimum value for encodingLength
without creating an object each invocation. Update isValidUtf8CodePoint to use
the precomputed array (or a 3-way ternary/switch) to obtain minValueForLength
for the given encodingLength and keep the existing validation checks intact; you
can mirror the approach used in decodeUtf8CodePoint to locate the replacement.

255-269: Consider avoiding per-call object allocation in a hot-path helper.

firstByteMasks is re-created on every invocation. Since getValidUtf8Length calls isValidUtf8SequencedecodeUtf8CodePoint for every character, this adds GC pressure on large inputs.

♻️ Replace object lookup with inline conditional
  static decodeUtf8CodePoint(bytes, start, length) {
    if (length === 1) return bytes[start];

-   // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-   const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-   let codePoint = bytes[start] & firstByteMasks[length];
+   // First byte mask: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
+   const mask = length === 2 ? 0x1f : length === 3 ? 0x0f : 0x07;
+   let codePoint = bytes[start] & mask;

    // Process continuation bytes (all use 0x3f mask, shift by 6 each)
    for (let i = 1; i < length; i++) {
      codePoint = (codePoint << 6) | (bytes[start + i] & 0x3f);
    }

    return codePoint;
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, The helper
decodeUtf8CodePoint currently allocates firstByteMasks on every call; replace
that per-call object with a static constant or inline conditional to avoid
allocation in the hot path (e.g., use a small switch or a module-level
array/const to map length→mask) and update decodeUtf8CodePoint to use that
constant/mapping; also audit callers getValidUtf8Length and isValidUtf8Sequence
to ensure they continue to pass the same start/length semantics.
test/javascript/bitstring_test.mjs (1)

5393-5470: Optional: de-duplicate one truncated-sequence test + align comments.

Two tests use the same input for a truncated 4‑byte sequence, and the first test’s comment mentions a continuation byte that isn’t present. Consider removing the duplicate and clarifying the comment to match the data.

♻️ Suggested cleanup
-    // Happy path: truncated 2-byte sequence
-    it("returns true for truncated 2-byte sequence with valid continuation byte", () => {
-      // 0xC2 requires 2 bytes, but only 1 byte available (0x80 is valid continuation)
+    // Happy path: truncated 2-byte sequence (leader only)
+    it("returns true for truncated 2-byte sequence with leader only", () => {
+      // 0xC2 requires 2 bytes, but only the leader is available
       const bytes = new Uint8Array([0xc2]);
       assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true);
     });
@@
-    // Edge case: multiple valid continuation bytes before truncation
-    it("returns true for 4-byte sequence with 2 valid continuation bytes (truncated)", () => {
-      // 0xF0 (4-byte) with 2 valid continuation bytes available
-      const bytes = new Uint8Array([0xf0, 0x90, 0x8d]);
-      assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true);
-    });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/javascript/bitstring_test.mjs` around lines 5393 - 5470, Two tests call
Bitstring.isTruncatedUtf8Sequence with the identical byte array for a truncated
4‑byte sequence; remove the duplicate test (the one titled "returns true for
4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single
clearly-named test (e.g., "returns true for truncated 4-byte sequence with valid
continuation bytes"), and update that test's comment to accurately state how
many continuation bytes are present (leader 0xF0 plus two continuation bytes
0x90, 0x8D, i.e., 3 total bytes available) so the description matches the input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@assets/js/bitstring.mjs`:
- Around line 663-678: The function isTruncatedUtf8Sequence currently reads
bytes[start] without checking bounds; add an early guard in
isTruncatedUtf8Sequence that returns false when start is out of range (start >=
bytes.length or start < 0) before calling $.getUtf8SequenceLength or accessing
bytes[start], preserving the existing logic for expectedLength and
continuation-byte checks.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 621-633: The isValidUtf8CodePoint function currently allocates a
new object minValueForLength on every call; replace that per-call allocation by
inlining the lookup (e.g., a small static array or ternary/switch) so you return
the minimum value for encodingLength without creating an object each invocation.
Update isValidUtf8CodePoint to use the precomputed array (or a 3-way
ternary/switch) to obtain minValueForLength for the given encodingLength and
keep the existing validation checks intact; you can mirror the approach used in
decodeUtf8CodePoint to locate the replacement.
- Around line 255-269: The helper decodeUtf8CodePoint currently allocates
firstByteMasks on every call; replace that per-call object with a static
constant or inline conditional to avoid allocation in the hot path (e.g., use a
small switch or a module-level array/const to map length→mask) and update
decodeUtf8CodePoint to use that constant/mapping; also audit callers
getValidUtf8Length and isValidUtf8Sequence to ensure they continue to pass the
same start/length semantics.

In `@assets/js/erlang/unicode.mjs`:
- Around line 320-328: Multiple closures (handleInvalidUtf8,
handleConversionError, validateListRest) are duplicated for each normalization
form ("NFC","NFD","NFKC","NFKD"); extract a single factory function (e.g.,
makeNormalizationHandlers or createNormalizer) that takes the normalization form
string and returns the three handlers, then replace the four per-form closure
groups with calls like const { handleInvalidUtf8, handleConversionError,
validateListRest } = makeNormalizationHandlers("NFC"); ensure the factory uses
Bitstring.getValidUtf8Length, Bitstring.fromBytes, Bitstring.toText and
Type.bitstring/Type.tuple exactly as current implementations do so behavior is
unchanged.

In `@test/javascript/bitstring_test.mjs`:
- Around line 5393-5470: Two tests call Bitstring.isTruncatedUtf8Sequence with
the identical byte array for a truncated 4‑byte sequence; remove the duplicate
test (the one titled "returns true for 4-byte sequence with 2 valid continuation
bytes (truncated)") and keep a single clearly-named test (e.g., "returns true
for truncated 4-byte sequence with valid continuation bytes"), and update that
test's comment to accurately state how many continuation bytes are present
(leader 0xF0 plus two continuation bytes 0x90, 0x8D, i.e., 3 total bytes
available) so the description matches the input.

@mward-sudo
Copy link
Contributor Author

@coderabbitai Don't do any further reviews on this PR while it is in draft.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: Understood! I'll pause automatic reviews on this PR while it remains in draft status. Feel free to tag me again when you're ready for a review.

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract UTF-8 valid length position to Bitstring class

1 participant