Extracts UTF-8 code point validation to BitString class by mward-sudo · Pull Request #716 · bartblast/hologram

mward-sudo · 2026-02-18T23:37:09Z

Closes #712

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

New Features
- Enhanced UTF-8 text handling with improved validation and code point decoding capabilities.
- Better support for Unicode character processing across different normalization formats.
Tests
- Added comprehensive test coverage for UTF-8 utilities, including single-byte and multi-byte character sequences, validation checks, and edge cases.

…ds parameter validation

coderabbitai · 2026-02-18T23:37:27Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR extracts UTF-8 handling utilities into the Bitstring class and refactors unicode.mjs to use these centralized helpers instead of local implementations. New static methods for decoding code points and validating UTF-8 sequences are added to Bitstring, with corresponding test coverage.

Changes

Cohort / File(s)	Summary
Bitstring UTF-8 utilities `assets/js/bitstring.mjs`	Adds three new static methods: `decodeUtf8CodePoint()` for decoding UTF-8 sequences, `isValidUtf8CodePoint()` for validating code points against surrogates and overlong encodings, and `isValidUtf8ContinuationByte()` for checking continuation bytes.
Unicode module refactoring `assets/js/erlang/unicode.mjs`	Replaces inline UTF-8 validation helpers with calls to Bitstring utility methods; updates normalization targets across NFC, NFD, NFKC, and NFKD conversion paths; reduces code duplication by centralizing UTF-8 logic.
Test coverage `test/javascript/bitstring_test.mjs`	Adds comprehensive test suites for new UTF-8 utilities, covering 1–4 byte sequences, continuation byte validation, code point validation including surrogates and overlong encodings, and UTF-8 sequence length detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Extract UTF-8 sequence length detection to Bitstring class #706: Also extracts UTF-8 helper logic to Bitstring and updates unicode.mjs to call Bitstring utilities for centralized UTF-8 handling.

Suggested reviewers

bartblast

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR contains duplicate method definitions in bitstring.mjs and unnecessary refactoring in unicode.mjs beyond extracting validation logic to BitString.	Remove duplicate method definitions in bitstring.mjs and limit unicode.mjs changes to only replacing inline validation helpers with BitString equivalents.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main objective of the PR: extracting UTF-8 code point validation logic to the BitString class.
Linked Issues check	✅ Passed	The PR addresses issue `#712` by extracting UTF-8 validation methods (isValidUtf8CodePoint, decodeUtf8CodePoint, isValidUtf8ContinuationByte) to the BitString class as required.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

assets/js/erlang/unicode.mjs (2)

332-447: Consider extracting the four normalization-binary functions into a shared parameterized helper.

characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 are each ~115 lines of nearly identical code. The only variance is the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). Each duplicates findValidUtf8Length, validateListRest, handleConversionError, handleInvalidUtf8, and the main logic.

Since this PR's theme is centralizing shared UTF-8 logic, this would be a natural follow-up: extract a single charactersToNormalizedBinary(data, form) helper and have each public function delegate to it. This would reduce ~460 duplicated lines to ~130.

♻️ Sketch of the refactored approach

+  // Shared helper for all normalization binary conversions
+  _charactersToNormalizedBinary: (data, normalizationForm) => {
+    const findValidUtf8Length = (bytes) => { /* shared implementation */ };
+    const validateListRest = (rest) => { /* shared implementation */ };
+    const handleConversionError = (tag, prefix, rest) => {
+      const textPrefix = Bitstring.toText(prefix);
+      const normalizedPrefix =
+        textPrefix === false
+          ? prefix
+          : Type.bitstring(textPrefix.normalize(normalizationForm));
+      // ... rest of shared logic
+    };
+    const handleInvalidUtf8 = (bytes) => {
+      // ... uses normalizationForm parameter
+    };
+    // ... shared main logic
+  },
+
   "characters_to_nfc_binary/1": (data) => {
-    // ~115 lines of inline logic
+    return Erlang_Unicode._charactersToNormalizedBinary(data, "NFC");
   },
   "characters_to_nfd_binary/1": (data) => {
-    // ~115 lines of inline logic
+    return Erlang_Unicode._charactersToNormalizedBinary(data, "NFD");
   },
   // ... same for NFKC, NFKD

Also applies to: 600-715, 720-833, 838-953

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 332 - 447, The four nearly
identical functions characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
UTF‑8 validation and normalization logic; extract a single helper (e.g.,
charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length,
validateListRest, handleConversionError, handleInvalidUtf8, and the shared main
logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter,
then have each public wrapper call that helper with the appropriate form string
so the duplicated blocks in characters_to_nfc_binary/1 and the other three
functions are removed and replaced by simple delegations.

854-856: Minor style inconsistency: braces around single-statement return false.

This is the only isValidUtf8ContinuationByte call site that wraps the return false in braces. All other identical call sites (lines 106, 129, 348, 616, 736) use the brace-less form.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 854 - 856, The call site that
checks Bitstring.isValidUtf8ContinuationByte currently uses braces around the
single-statement "return false"; update that conditional to use the brace-less
single-line form (i.e., remove the surrounding { } so it matches the other call
sites that call Bitstring.isValidUtf8ContinuationByte) in the same function
where bytes[start + i] is validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive validation at the start of
decodeUtf8CodePoint: verify that bytes is present and has a .length, that length
is an integer between 1 and 4, and that start is a non-negative integer with
start + length <= bytes.length; if any check fails throw a RangeError (with a
clear message including the invalid values). Keep the existing decoding logic
unchanged after these checks so callers still get the same behavior when
arguments are valid.
- Around line 596-617: isValidUtf8CodePoint currently trusts encodingLength and
can silently skip the overlong check if encodingLength is invalid; add a
defensive guard at the start of isValidUtf8CodePoint to verify encodingLength is
an integer 1..4 (or otherwise a valid key for minValueForLength) and immediately
return false for any invalid value (this prevents using undefined
minValueForLength[encodingLength] and ensures callers like getUtf8SequenceLength
cannot cause silent failures).

---

Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 332-447: The four nearly identical functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
UTF‑8 validation and normalization logic; extract a single helper (e.g.,
charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length,
validateListRest, handleConversionError, handleInvalidUtf8, and the shared main
logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter,
then have each public wrapper call that helper with the appropriate form string
so the duplicated blocks in characters_to_nfc_binary/1 and the other three
functions are removed and replaced by simple delegations.
- Around line 854-856: The call site that checks
Bitstring.isValidUtf8ContinuationByte currently uses braces around the
single-statement "return false"; update that conditional to use the brace-less
single-line form (i.e., remove the surrounding { } so it matches the other call
sites that call Bitstring.isValidUtf8ContinuationByte) in the same function
where bytes[start + i] is validated.

assets/js/bitstring.mjs

mward-sudo · 2026-02-19T00:28:32Z

@coderabbitai Don't do any further reviews on this PR while it is in draft.

coderabbitai · 2026-02-19T00:28:50Z

@mward-sudo: Understood! I'll pause automatic reviews for this PR while it remains in draft. Feel free to mention me again when you're ready for reviews to resume.

✅ Actions performed

Reviews paused.

mward-sudo added 3 commits February 18, 2026 22:39

Extracts UTF-8 code point decoding to the BitString utility class, ad…

9a49bec

…ds parameter validation

Extracts UTF-8 continuation byte validation to BitString class

105d2d9

Extracts Utf-8 code point validation to BitString class

7015625

This was referenced Feb 18, 2026

Extracts UTF-8 sequence validation to BitString class #717

Draft

Extracts truncated UTF-8 sequence validation to BitString class #718

Draft

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

assets/js/bitstring.mjs Show resolved Hide resolved

assets/js/bitstring.mjs Show resolved Hide resolved

mward-sudo mentioned this pull request Feb 18, 2026

Extracts UTF-8 valid length position to Bitstring class #719

Draft

5 tasks

This was referenced Feb 19, 2026

Extracts BitString to code points array to BitString class #721

Draft

Extracts BitString from code point to BitString class #723

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Extracts UTF-8 code point validation to BitString class#716

Extracts UTF-8 code point validation to BitString class#716
mward-sudo wants to merge 3 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_code_point_validation_to_bitstring_class

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

mward-sudo commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading