Skip to content

Comments

Extract UTF-8 continuation byte validation to Bitstring class#715

Open
mward-sudo wants to merge 1 commit intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class
Open

Extract UTF-8 continuation byte validation to Bitstring class#715
mward-sudo wants to merge 1 commit intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 18, 2026

Closes #711

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependant PR(s) are merged to dev branch then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Added a public helper to validate UTF‑8 continuation bytes.
  • Refactor

    • Centralized continuation-byte validation to reuse the new helper across UTF‑8 decoding paths for consistent handling.
  • Tests

    • Added tests covering valid and invalid UTF‑8 continuation-byte patterns and related decoding edge cases.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc9948f and a3c80d3.

📒 Files selected for processing (3)
  • assets/js/bitstring.mjs
  • assets/js/erlang/unicode.mjs
  • test/javascript/bitstring_test.mjs
🚧 Files skipped from review as they are similar to previous changes (3)
  • assets/js/bitstring.mjs
  • test/javascript/bitstring_test.mjs
  • assets/js/erlang/unicode.mjs

📝 Walkthrough

Walkthrough

Adds a new static helper Bitstring.isValidUtf8ContinuationByte(byte) and replaces in-file UTF‑8 continuation byte checks in the unicode module with calls to that helper; unit tests for the new helper were added.

Changes

Cohort / File(s) Summary
Bitstring helper
assets/js/bitstring.mjs
Adds static isValidUtf8ContinuationByte(byte) to centralize UTF‑8 continuation‑byte validation.
Unicode validation refactor
assets/js/erlang/unicode.mjs
Replaces local continuation‑byte checks with Bitstring.isValidUtf8ContinuationByte() across UTF‑8 validation paths; removed duplicate local helpers.
Tests
test/javascript/bitstring_test.mjs
Adds tests exercising isValidUtf8ContinuationByte() for valid 10xxxxxx patterns and various invalid byte patterns.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: extracting UTF-8 continuation byte validation into the Bitstring class, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR successfully implements issue #711's requirement to create BitString.isValidUtf8ContinuationByte(byte) and refactors existing UTF-8 validation to use this new utility method.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: adding the utility method to Bitstring, refactoring existing validation calls, and adding corresponding tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)

250-269: The validation suggestion is reasonable but not mandatory. All production calls to decodeUtf8CodePoint() already validate inputs: getUtf8SequenceLength() returns only 1, 2, 3, 4, or false, and callers check start + length > bytes.length before decoding. Adding guards in the helper would be defensive and easier to debug, but is not required for correctness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 250 - 269, Add defensive input
validation to decodeUtf8CodePoint: verify that the length parameter is one of
1..4 and that start and start+length are within bytes.length; if invalid, throw
a descriptive RangeError (include function name and offending values). Also
ensure the 1-byte fast path still returns bytes[start] after bounds check. This
uses the existing helpers getUtf8SequenceLength callers but makes
decodeUtf8CodePoint self-checking for easier debugging.
assets/js/erlang/unicode.mjs (2)

944-946: Brace style inconsistency vs. the other three occurrences.

The characters_to_nfkd_binary/1 version uses an explicit block { return false; }, while the equivalent guards at Lines 124-125, 384-385, 670-671, and 808-809 all use the brace-less single-statement form.

♻️ Align with the rest of the file
         for (let i = 1; i < length; i++) {
-          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) {
-            return false;
-          }
+          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+            return false;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 944 - 946, The if-statement inside
characters_to_nfkd_binary/1 that checks
Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line block
with braces and an explicit return; change it to the brace-less single-statement
form (i.e., replace the block "{ return false; }" with a single "return false;"
statement on the same line as the if) so it matches the other occurrences that
use the concise style around Bitstring.isValidUtf8ContinuationByte.

357-403: Extract the duplicated findValidUtf8Length (and siblings) to module-level helpers.

findValidUtf8Length — along with its nested isValidCodePoint and isValidSequence — is copy-pasted verbatim across all four normalization functions. validateListRest, handleConversionError, and handleInvalidUtf8 are also near-identical, differing only in the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). Now that the internal byte-level logic is Bitstring-backed, all four copies are truly identical, making this the natural follow-on to the current refactoring.

The four normalization functions can collapse to a single parameterized helper:

♻️ Suggested module-level extraction sketch
+// Scans forward to find the longest valid UTF-8 prefix. Returns the byte length.
+const findValidUtf8Length = (bytes) => {
+  const isValidCodePoint = (codePoint, encodingLength) => {
+    const minValueForLength = [0, 0, 0x80, 0x800, 0x10000];
+    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false;
+    if (codePoint > 0x10ffff) return false;
+    return true;
+  };
+
+  const isValidSequence = (start, length) => {
+    if (start + length > bytes.length) return false;
+    for (let i = 1; i < length; i++) {
+      if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+        return false;
+    }
+    const codePoint = Bitstring.decodeUtf8CodePoint(bytes, start, length);
+    return isValidCodePoint(codePoint, length);
+  };
+
+  let pos = 0;
+  while (pos < bytes.length) {
+    const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
+    if (seqLength === false || !isValidSequence(pos, seqLength)) break;
+    pos += seqLength;
+  }
+  return pos;
+};
+
+// Shared helper for NFC/NFD/NFKC/NFKD normalization functions
+const buildNormalizeBinary = (normForm) => (data) => {
+  // ... handleConversionError / handleInvalidUtf8 parameterised by normForm
+};

Each characters_to_nf*_binary/1 then becomes a one-liner delegating to buildNormalizeBinary("NFC"), etc.

Also applies to: 643-688, 781-825, 917-963

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 357 - 403, The duplicated UTF‑8
validation and normalization logic (findValidUtf8Length and its nested helpers
isValidCodePoint and isValidSequence) plus the near-identical helpers
validateListRest, handleConversionError, and handleInvalidUtf8 should be pulled
out to module-level functions; create a single exported helper (e.g.,
buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization
form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for
UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and
error/restore handling; then replace the four characters_to_nf*_binary/1
implementations with one-liners delegating to that helper so the byte-level
Bitstring-backed logic is shared across all forms.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive input validation to decodeUtf8CodePoint:
verify that the length parameter is one of 1..4 and that start and start+length
are within bytes.length; if invalid, throw a descriptive RangeError (include
function name and offending values). Also ensure the 1-byte fast path still
returns bytes[start] after bounds check. This uses the existing helpers
getUtf8SequenceLength callers but makes decodeUtf8CodePoint self-checking for
easier debugging.

In `@assets/js/erlang/unicode.mjs`:
- Around line 944-946: The if-statement inside characters_to_nfkd_binary/1 that
checks Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line
block with braces and an explicit return; change it to the brace-less
single-statement form (i.e., replace the block "{ return false; }" with a single
"return false;" statement on the same line as the if) so it matches the other
occurrences that use the concise style around
Bitstring.isValidUtf8ContinuationByte.
- Around line 357-403: The duplicated UTF‑8 validation and normalization logic
(findValidUtf8Length and its nested helpers isValidCodePoint and
isValidSequence) plus the near-identical helpers validateListRest,
handleConversionError, and handleInvalidUtf8 should be pulled out to
module-level functions; create a single exported helper (e.g.,
buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization
form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for
UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and
error/restore handling; then replace the four characters_to_nf*_binary/1
implementations with one-liners delegating to that helper so the byte-level
Bitstring-backed logic is shared across all forms.

@mward-sudo
Copy link
Contributor Author

@coderabbitai Don't do any further reviews on this PR while it is in draft,

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: Understood! I'll pause automatic reviews on this PR while it remains in draft. Feel free to mention me when you're ready for me to resume reviews.

✅ Actions performed

Reviews paused.

@mward-sudo mward-sudo force-pushed the 02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class branch from 105d2d9 to bc9948f Compare February 19, 2026 23:55
@mward-sudo
Copy link
Contributor Author

@coderabbitai resume, full review and update summary

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: I'll resume automatic reviews, conduct a comprehensive review of all the changes, and regenerate the summary for this PR.


🧠 Learnings used
Learnt from: mward-sudo
Repo: bartblast/hologram PR: 0
File: :0-0
Timestamp: 2026-02-19T01:23:42.278Z
Learning: User mward-sudo prefers that draft PRs should not be automatically reviewed by CodeRabbit until the draft status is removed.
✅ Actions performed

Reviews resumed.


Summary regeneration triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
assets/js/erlang/unicode.mjs (1)

944-947: Minor style inconsistency: braces differ from the other four identical call sites.

The isValidSequence in characters_to_nfkd_binary/1 uses block braces while all four other equivalent implementations (lines 124-125, 384-385, 670-671, 808-809) are brace-less. Either style is fine, but the deviation is unnecessary.

🔧 Proposed fix to align with the other four implementations
         for (let i = 1; i < length; i++) {
-          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) {
-            return false;
-          }
+          if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i]))
+            return false;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 944 - 947, In the isValidSequence
implementation used by characters_to_nfkd_binary/1, change the if statement that
currently uses block braces around the single-line body to the brace-less style
used at the other call sites (i.e. replace "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) { return false; }"
with the single-line form "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so
the style matches the other four equivalent implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 944-947: In the isValidSequence implementation used by
characters_to_nfkd_binary/1, change the if statement that currently uses block
braces around the single-line body to the brace-less style used at the other
call sites (i.e. replace "if (!Bitstring.isValidUtf8ContinuationByte(bytes[start
+ i])) { return false; }" with the single-line form "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so
the style matches the other four equivalent implementations.

@mward-sudo mward-sudo marked this pull request as ready for review February 20, 2026 00:04
@bartblast bartblast changed the title Extracts UTF-8 continuation byte validation to BitString class Extract UTF-8 continuation byte validation to Bitstring class Feb 21, 2026
@mward-sudo mward-sudo force-pushed the 02-18-extracts_utf-8_continuation_byte_validation_to_bitstring_class branch from bc9948f to a3c80d3 Compare February 24, 2026 00:56
@mward-sudo mward-sudo requested a review from bartblast February 24, 2026 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract UTF-8 continuation byte validation to Bitstring class

2 participants