Extract UTF-8 continuation byte validation to Bitstring class by mward-sudo · Pull Request #715 · bartblast/hologram

mward-sudo · 2026-02-18T23:33:04Z

Closes #711

Dependencies

Extract UTF-8 code point decoding to Bitstring class #710

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependant PR(s) are merged to dev branch then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

New Features
- Added a public helper to validate UTF‑8 continuation bytes.
Refactor
- Centralized continuation-byte validation to reuse the new helper across UTF‑8 decoding paths for consistent handling.
Tests
- Added tests covering valid and invalid UTF‑8 continuation-byte patterns and related decoding edge cases.

coderabbitai · 2026-02-18T23:33:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc9948f and a3c80d3.

📒 Files selected for processing (3)

assets/js/bitstring.mjs
assets/js/erlang/unicode.mjs
test/javascript/bitstring_test.mjs

🚧 Files skipped from review as they are similar to previous changes (3)

assets/js/bitstring.mjs
test/javascript/bitstring_test.mjs
assets/js/erlang/unicode.mjs

📝 Walkthrough

Walkthrough

Adds a new static helper Bitstring.isValidUtf8ContinuationByte(byte) and replaces in-file UTF‑8 continuation byte checks in the unicode module with calls to that helper; unit tests for the new helper were added.

Changes

Cohort / File(s)	Summary
Bitstring helper `assets/js/bitstring.mjs`	Adds `static isValidUtf8ContinuationByte(byte)` to centralize UTF‑8 continuation‑byte validation.
Unicode validation refactor `assets/js/erlang/unicode.mjs`	Replaces local continuation‑byte checks with `Bitstring.isValidUtf8ContinuationByte()` across UTF‑8 validation paths; removed duplicate local helpers.
Tests `test/javascript/bitstring_test.mjs`	Adds tests exercising `isValidUtf8ContinuationByte()` for valid `10xxxxxx` patterns and various invalid byte patterns.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Extract UTF-8 sequence length detection to Bitstring class #706 — Both PRs extract UTF‑8 validation logic into Bitstring and update assets/js/erlang/unicode.mjs to use the new Bitstring helper.

Suggested reviewers

bartblast

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: extracting UTF-8 continuation byte validation into the Bitstring class, which is the primary objective of the PR.
Linked Issues check	✅ Passed	The PR successfully implements issue `#711`'s requirement to create BitString.isValidUtf8ContinuationByte(byte) and refactors existing UTF-8 validation to use this new utility method.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the linked issue: adding the utility method to Bitstring, refactoring existing validation calls, and adding corresponding tests.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

assets/js/bitstring.mjs (1)

250-269: The validation suggestion is reasonable but not mandatory. All production calls to decodeUtf8CodePoint() already validate inputs: getUtf8SequenceLength() returns only 1, 2, 3, 4, or false, and callers check start + length > bytes.length before decoding. Adding guards in the helper would be defensive and easier to debug, but is not required for correctness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 250 - 269, Add defensive input
validation to decodeUtf8CodePoint: verify that the length parameter is one of
1..4 and that start and start+length are within bytes.length; if invalid, throw
a descriptive RangeError (include function name and offending values). Also
ensure the 1-byte fast path still returns bytes[start] after bounds check. This
uses the existing helpers getUtf8SequenceLength callers but makes
decodeUtf8CodePoint self-checking for easier debugging.

assets/js/erlang/unicode.mjs (2)

944-946: Brace style inconsistency vs. the other three occurrences.

The characters_to_nfkd_binary/1 version uses an explicit block { return false; }, while the equivalent guards at Lines 124-125, 384-385, 670-671, and 808-809 all use the brace-less single-statement form.

♻️ Align with the rest of the file

         for (let i = 1; i < length; i++) {
-          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) {
-            return false;
-          }
+          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+            return false;
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 944 - 946, The if-statement inside
characters_to_nfkd_binary/1 that checks
Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line block
with braces and an explicit return; change it to the brace-less single-statement
form (i.e., replace the block "{ return false; }" with a single "return false;"
statement on the same line as the if) so it matches the other occurrences that
use the concise style around Bitstring.isValidUtf8ContinuationByte.

357-403: Extract the duplicated findValidUtf8Length (and siblings) to module-level helpers.

findValidUtf8Length — along with its nested isValidCodePoint and isValidSequence — is copy-pasted verbatim across all four normalization functions. validateListRest, handleConversionError, and handleInvalidUtf8 are also near-identical, differing only in the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). Now that the internal byte-level logic is Bitstring-backed, all four copies are truly identical, making this the natural follow-on to the current refactoring.

The four normalization functions can collapse to a single parameterized helper:

♻️ Suggested module-level extraction sketch

+// Scans forward to find the longest valid UTF-8 prefix. Returns the byte length.
+const findValidUtf8Length = (bytes) => {
+  const isValidCodePoint = (codePoint, encodingLength) => {
+    const minValueForLength = [0, 0, 0x80, 0x800, 0x10000];
+    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false;
+    if (codePoint > 0x10ffff) return false;
+    return true;
+  };
+
+  const isValidSequence = (start, length) => {
+    if (start + length > bytes.length) return false;
+    for (let i = 1; i < length; i++) {
+      if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+        return false;
+    }
+    const codePoint = Bitstring.decodeUtf8CodePoint(bytes, start, length);
+    return isValidCodePoint(codePoint, length);
+  };
+
+  let pos = 0;
+  while (pos < bytes.length) {
+    const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
+    if (seqLength === false || !isValidSequence(pos, seqLength)) break;
+    pos += seqLength;
+  }
+  return pos;
+};
+
+// Shared helper for NFC/NFD/NFKC/NFKD normalization functions
+const buildNormalizeBinary = (normForm) => (data) => {
+  // ... handleConversionError / handleInvalidUtf8 parameterised by normForm
+};

Each characters_to_nf*_binary/1 then becomes a one-liner delegating to buildNormalizeBinary("NFC"), etc.

Also applies to: 643-688, 781-825, 917-963

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 357 - 403, The duplicated UTF‑8
validation and normalization logic (findValidUtf8Length and its nested helpers
isValidCodePoint and isValidSequence) plus the near-identical helpers
validateListRest, handleConversionError, and handleInvalidUtf8 should be pulled
out to module-level functions; create a single exported helper (e.g.,
buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization
form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for
UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and
error/restore handling; then replace the four characters_to_nf*_binary/1
implementations with one-liners delegating to that helper so the byte-level
Bitstring-backed logic is shared across all forms.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive input validation to decodeUtf8CodePoint:
verify that the length parameter is one of 1..4 and that start and start+length
are within bytes.length; if invalid, throw a descriptive RangeError (include
function name and offending values). Also ensure the 1-byte fast path still
returns bytes[start] after bounds check. This uses the existing helpers
getUtf8SequenceLength callers but makes decodeUtf8CodePoint self-checking for
easier debugging.

In `@assets/js/erlang/unicode.mjs`:
- Around line 944-946: The if-statement inside characters_to_nfkd_binary/1 that
checks Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line
block with braces and an explicit return; change it to the brace-less
single-statement form (i.e., replace the block "{ return false; }" with a single
"return false;" statement on the same line as the if) so it matches the other
occurrences that use the concise style around
Bitstring.isValidUtf8ContinuationByte.
- Around line 357-403: The duplicated UTF‑8 validation and normalization logic
(findValidUtf8Length and its nested helpers isValidCodePoint and
isValidSequence) plus the near-identical helpers validateListRest,
handleConversionError, and handleInvalidUtf8 should be pulled out to
module-level functions; create a single exported helper (e.g.,
buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization
form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for
UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and
error/restore handling; then replace the four characters_to_nf*_binary/1
implementations with one-liners delegating to that helper so the byte-level
Bitstring-backed logic is shared across all forms.

mward-sudo · 2026-02-19T00:26:59Z

@coderabbitai Don't do any further reviews on this PR while it is in draft,

coderabbitai · 2026-02-19T00:27:20Z

@mward-sudo: Understood! I'll pause automatic reviews on this PR while it remains in draft. Feel free to mention me when you're ready for me to resume reviews.

✅ Actions performed

Reviews paused.

mward-sudo · 2026-02-19T23:57:05Z

@coderabbitai resume, full review and update summary

coderabbitai · 2026-02-19T23:57:22Z

@mward-sudo: I'll resume automatic reviews, conduct a comprehensive review of all the changes, and regenerate the summary for this PR.

🧠 Learnings used

Learnt from: mward-sudo
Repo: bartblast/hologram PR: 0
File: :0-0
Timestamp: 2026-02-19T01:23:42.278Z
Learning: User mward-sudo prefers that draft PRs should not be automatically reviewed by CodeRabbit until the draft status is removed.

✅ Actions performed

Reviews resumed.

Summary regeneration triggered.

coderabbitai

🧹 Nitpick comments (1)

assets/js/erlang/unicode.mjs (1)

944-947: Minor style inconsistency: braces differ from the other four identical call sites.

The isValidSequence in characters_to_nfkd_binary/1 uses block braces while all four other equivalent implementations (lines 124-125, 384-385, 670-671, 808-809) are brace-less. Either style is fine, but the deviation is unnecessary.

🔧 Proposed fix to align with the other four implementations

         for (let i = 1; i < length; i++) {
-          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) {
-            return false;
-          }
+          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+            return false;
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 944 - 947, In the isValidSequence
implementation used by characters_to_nfkd_binary/1, change the if statement that
currently uses block braces around the single-line body to the brace-less style
used at the other call sites (i.e. replace "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) { return false; }"
with the single-line form "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so
the style matches the other four equivalent implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 944-947: In the isValidSequence implementation used by
characters_to_nfkd_binary/1, change the if statement that currently uses block
braces around the single-line body to the brace-less style used at the other
call sites (i.e. replace "if (!Bitstring.isValidUtf8ContinuationByte(bytes[start
+ i])) { return false; }" with the single-line form "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so
the style matches the other four equivalent implementations.

test/javascript/bitstring_test.mjs

mward-sudo mentioned this pull request Feb 18, 2026

Extracts UTF-8 code point validation to BitString class #716

Draft

2 tasks

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

This was referenced Feb 18, 2026

Extracts UTF-8 sequence validation to BitString class #717

Draft

Extracts truncated UTF-8 sequence validation to BitString class #718

Draft

Extracts UTF-8 valid length position to Bitstring class #719

Draft

This was referenced Feb 19, 2026

Extracts BitString to code points array to BitString class #721

Draft

Extracts BitString from code point to BitString class #723

Draft

mward-sudo force-pushed the 02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class branch from 105d2d9 to bc9948f Compare February 19, 2026 23:55

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

mward-sudo marked this pull request as ready for review February 20, 2026 00:04

bartblast changed the title ~~Extracts UTF-8 continuation byte validation to BitString class~~ Extract UTF-8 continuation byte validation to Bitstring class Feb 21, 2026

bartblast requested changes Feb 23, 2026

View reviewed changes

test/javascript/bitstring_test.mjs Show resolved Hide resolved

Extracts UTF-8 continuation byte validation to BitString class

a3c80d3

mward-sudo force-pushed the 02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class branch from bc9948f to a3c80d3 Compare February 24, 2026 00:56

mward-sudo requested a review from bartblast February 24, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Extract UTF-8 continuation byte validation to Bitstring class#715

Extract UTF-8 continuation byte validation to Bitstring class#715
mward-sudo wants to merge 1 commit intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

mward-sudo commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading