Skip to content

Comments

Extracts UTF-8 code point validation to BitString class#716

Draft
mward-sudo wants to merge 3 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_code_point_validation_to_bitstring_class
Draft

Extracts UTF-8 code point validation to BitString class#716
mward-sudo wants to merge 3 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_code_point_validation_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 18, 2026

Closes #712

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Enhanced UTF-8 text handling with improved validation and code point decoding capabilities.
    • Better support for Unicode character processing across different normalization formats.
  • Tests

    • Added comprehensive test coverage for UTF-8 utilities, including single-byte and multi-byte character sequences, validation checks, and edge cases.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR extracts UTF-8 handling utilities into the Bitstring class and refactors unicode.mjs to use these centralized helpers instead of local implementations. New static methods for decoding code points and validating UTF-8 sequences are added to Bitstring, with corresponding test coverage.

Changes

Cohort / File(s) Summary
Bitstring UTF-8 utilities
assets/js/bitstring.mjs
Adds three new static methods: decodeUtf8CodePoint() for decoding UTF-8 sequences, isValidUtf8CodePoint() for validating code points against surrogates and overlong encodings, and isValidUtf8ContinuationByte() for checking continuation bytes.
Unicode module refactoring
assets/js/erlang/unicode.mjs
Replaces inline UTF-8 validation helpers with calls to Bitstring utility methods; updates normalization targets across NFC, NFD, NFKC, and NFKD conversion paths; reduces code duplication by centralizing UTF-8 logic.
Test coverage
test/javascript/bitstring_test.mjs
Adds comprehensive test suites for new UTF-8 utilities, covering 1–4 byte sequences, continuation byte validation, code point validation including surrogates and overlong encodings, and UTF-8 sequence length detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR contains duplicate method definitions in bitstring.mjs and unnecessary refactoring in unicode.mjs beyond extracting validation logic to BitString. Remove duplicate method definitions in bitstring.mjs and limit unicode.mjs changes to only replacing inline validation helpers with BitString equivalents.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the PR: extracting UTF-8 code point validation logic to the BitString class.
Linked Issues check ✅ Passed The PR addresses issue #712 by extracting UTF-8 validation methods (isValidUtf8CodePoint, decodeUtf8CodePoint, isValidUtf8ContinuationByte) to the BitString class as required.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
assets/js/erlang/unicode.mjs (2)

332-447: Consider extracting the four normalization-binary functions into a shared parameterized helper.

characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 are each ~115 lines of nearly identical code. The only variance is the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). Each duplicates findValidUtf8Length, validateListRest, handleConversionError, handleInvalidUtf8, and the main logic.

Since this PR's theme is centralizing shared UTF-8 logic, this would be a natural follow-up: extract a single charactersToNormalizedBinary(data, form) helper and have each public function delegate to it. This would reduce ~460 duplicated lines to ~130.

♻️ Sketch of the refactored approach
+  // Shared helper for all normalization binary conversions
+  _charactersToNormalizedBinary: (data, normalizationForm) => {
+    const findValidUtf8Length = (bytes) => { /* shared implementation */ };
+    const validateListRest = (rest) => { /* shared implementation */ };
+    const handleConversionError = (tag, prefix, rest) => {
+      const textPrefix = Bitstring.toText(prefix);
+      const normalizedPrefix =
+        textPrefix === false
+          ? prefix
+          : Type.bitstring(textPrefix.normalize(normalizationForm));
+      // ... rest of shared logic
+    };
+    const handleInvalidUtf8 = (bytes) => {
+      // ... uses normalizationForm parameter
+    };
+    // ... shared main logic
+  },
+
   "characters_to_nfc_binary/1": (data) => {
-    // ~115 lines of inline logic
+    return Erlang_Unicode._charactersToNormalizedBinary(data, "NFC");
   },
   "characters_to_nfd_binary/1": (data) => {
-    // ~115 lines of inline logic
+    return Erlang_Unicode._charactersToNormalizedBinary(data, "NFD");
   },
   // ... same for NFKC, NFKD

Also applies to: 600-715, 720-833, 838-953

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 332 - 447, The four nearly
identical functions characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
UTF‑8 validation and normalization logic; extract a single helper (e.g.,
charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length,
validateListRest, handleConversionError, handleInvalidUtf8, and the shared main
logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter,
then have each public wrapper call that helper with the appropriate form string
so the duplicated blocks in characters_to_nfc_binary/1 and the other three
functions are removed and replaced by simple delegations.

854-856: Minor style inconsistency: braces around single-statement return false.

This is the only isValidUtf8ContinuationByte call site that wraps the return false in braces. All other identical call sites (lines 106, 129, 348, 616, 736) use the brace-less form.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 854 - 856, The call site that
checks Bitstring.isValidUtf8ContinuationByte currently uses braces around the
single-statement "return false"; update that conditional to use the brace-less
single-line form (i.e., remove the surrounding { } so it matches the other call
sites that call Bitstring.isValidUtf8ContinuationByte) in the same function
where bytes[start + i] is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive validation at the start of
decodeUtf8CodePoint: verify that bytes is present and has a .length, that length
is an integer between 1 and 4, and that start is a non-negative integer with
start + length <= bytes.length; if any check fails throw a RangeError (with a
clear message including the invalid values). Keep the existing decoding logic
unchanged after these checks so callers still get the same behavior when
arguments are valid.
- Around line 596-617: isValidUtf8CodePoint currently trusts encodingLength and
can silently skip the overlong check if encodingLength is invalid; add a
defensive guard at the start of isValidUtf8CodePoint to verify encodingLength is
an integer 1..4 (or otherwise a valid key for minValueForLength) and immediately
return false for any invalid value (this prevents using undefined
minValueForLength[encodingLength] and ensures callers like getUtf8SequenceLength
cannot cause silent failures).

---

Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 332-447: The four nearly identical functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
UTF‑8 validation and normalization logic; extract a single helper (e.g.,
charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length,
validateListRest, handleConversionError, handleInvalidUtf8, and the shared main
logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter,
then have each public wrapper call that helper with the appropriate form string
so the duplicated blocks in characters_to_nfc_binary/1 and the other three
functions are removed and replaced by simple delegations.
- Around line 854-856: The call site that checks
Bitstring.isValidUtf8ContinuationByte currently uses braces around the
single-statement "return false"; update that conditional to use the brace-less
single-line form (i.e., remove the surrounding { } so it matches the other call
sites that call Bitstring.isValidUtf8ContinuationByte) in the same function
where bytes[start + i] is validated.

@mward-sudo
Copy link
Contributor Author

@coderabbitai Don't do any further reviews on this PR while it is in draft.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: Understood! I'll pause automatic reviews for this PR while it remains in draft. Feel free to mention me again when you're ready for reviews to resume.

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract UTF-8 code point validation to Bitstring class

1 participant