Extract UTF-8 continuation byte validation to Bitstring class#715
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds a new static helper Bitstring.isValidUtf8ContinuationByte(byte) and replaces in-file UTF‑8 continuation byte checks in the unicode module with calls to that helper; unit tests for the new helper were added. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)
250-269: The validation suggestion is reasonable but not mandatory. All production calls todecodeUtf8CodePoint()already validate inputs:getUtf8SequenceLength()returns only 1, 2, 3, 4, or false, and callers checkstart + length > bytes.lengthbefore decoding. Adding guards in the helper would be defensive and easier to debug, but is not required for correctness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 250 - 269, Add defensive input validation to decodeUtf8CodePoint: verify that the length parameter is one of 1..4 and that start and start+length are within bytes.length; if invalid, throw a descriptive RangeError (include function name and offending values). Also ensure the 1-byte fast path still returns bytes[start] after bounds check. This uses the existing helpers getUtf8SequenceLength callers but makes decodeUtf8CodePoint self-checking for easier debugging.assets/js/erlang/unicode.mjs (2)
944-946: Brace style inconsistency vs. the other three occurrences.The
characters_to_nfkd_binary/1version uses an explicit block{ return false; }, while the equivalent guards at Lines 124-125, 384-385, 670-671, and 808-809 all use the brace-less single-statement form.♻️ Align with the rest of the file
for (let i = 1; i < length; i++) { - if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) { - return false; - } + if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) + return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 944 - 946, The if-statement inside characters_to_nfkd_binary/1 that checks Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line block with braces and an explicit return; change it to the brace-less single-statement form (i.e., replace the block "{ return false; }" with a single "return false;" statement on the same line as the if) so it matches the other occurrences that use the concise style around Bitstring.isValidUtf8ContinuationByte.
357-403: Extract the duplicatedfindValidUtf8Length(and siblings) to module-level helpers.
findValidUtf8Length— along with its nestedisValidCodePointandisValidSequence— is copy-pasted verbatim across all four normalization functions.validateListRest,handleConversionError, andhandleInvalidUtf8are also near-identical, differing only in the normalization form string ("NFC","NFD","NFKC","NFKD"). Now that the internal byte-level logic is Bitstring-backed, all four copies are truly identical, making this the natural follow-on to the current refactoring.The four normalization functions can collapse to a single parameterized helper:
♻️ Suggested module-level extraction sketch
+// Scans forward to find the longest valid UTF-8 prefix. Returns the byte length. +const findValidUtf8Length = (bytes) => { + const isValidCodePoint = (codePoint, encodingLength) => { + const minValueForLength = [0, 0, 0x80, 0x800, 0x10000]; + if (codePoint < minValueForLength[encodingLength]) return false; + if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false; + if (codePoint > 0x10ffff) return false; + return true; + }; + + const isValidSequence = (start, length) => { + if (start + length > bytes.length) return false; + for (let i = 1; i < length; i++) { + if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) + return false; + } + const codePoint = Bitstring.decodeUtf8CodePoint(bytes, start, length); + return isValidCodePoint(codePoint, length); + }; + + let pos = 0; + while (pos < bytes.length) { + const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]); + if (seqLength === false || !isValidSequence(pos, seqLength)) break; + pos += seqLength; + } + return pos; +}; + +// Shared helper for NFC/NFD/NFKC/NFKD normalization functions +const buildNormalizeBinary = (normForm) => (data) => { + // ... handleConversionError / handleInvalidUtf8 parameterised by normForm +};Each
characters_to_nf*_binary/1then becomes a one-liner delegating tobuildNormalizeBinary("NFC"), etc.Also applies to: 643-688, 781-825, 917-963
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 357 - 403, The duplicated UTF‑8 validation and normalization logic (findValidUtf8Length and its nested helpers isValidCodePoint and isValidSequence) plus the near-identical helpers validateListRest, handleConversionError, and handleInvalidUtf8 should be pulled out to module-level functions; create a single exported helper (e.g., buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and error/restore handling; then replace the four characters_to_nf*_binary/1 implementations with one-liners delegating to that helper so the byte-level Bitstring-backed logic is shared across all forms.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive input validation to decodeUtf8CodePoint:
verify that the length parameter is one of 1..4 and that start and start+length
are within bytes.length; if invalid, throw a descriptive RangeError (include
function name and offending values). Also ensure the 1-byte fast path still
returns bytes[start] after bounds check. This uses the existing helpers
getUtf8SequenceLength callers but makes decodeUtf8CodePoint self-checking for
easier debugging.
In `@assets/js/erlang/unicode.mjs`:
- Around line 944-946: The if-statement inside characters_to_nfkd_binary/1 that
checks Bitstring.isValidUtf8ContinuationByte(bytes[start + i]) uses a multi-line
block with braces and an explicit return; change it to the brace-less
single-statement form (i.e., replace the block "{ return false; }" with a single
"return false;" statement on the same line as the if) so it matches the other
occurrences that use the concise style around
Bitstring.isValidUtf8ContinuationByte.
- Around line 357-403: The duplicated UTF‑8 validation and normalization logic
(findValidUtf8Length and its nested helpers isValidCodePoint and
isValidSequence) plus the near-identical helpers validateListRest,
handleConversionError, and handleInvalidUtf8 should be pulled out to
module-level functions; create a single exported helper (e.g.,
buildNormalizeBinary or normalizeBinaryForForm) that accepts the normalization
form string ("NFC","NFD","NFKC","NFKD") and reuses module-level functions for
UTF‑8 checks (findValidUtf8Length, isValidCodePoint, isValidSequence) and
error/restore handling; then replace the four characters_to_nf*_binary/1
implementations with one-liners delegating to that helper so the byte-level
Bitstring-backed logic is shared across all forms.
|
@coderabbitai Don't do any further reviews on this PR while it is in draft, |
|
✅ Actions performedReviews paused. |
105d2d9 to
bc9948f
Compare
|
@coderabbitai resume, full review and update summary |
|
🧠 Learnings used✅ Actions performedReviews resumed. Summary regeneration triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
assets/js/erlang/unicode.mjs (1)
944-947: Minor style inconsistency: braces differ from the other four identical call sites.The
isValidSequenceincharacters_to_nfkd_binary/1uses block braces while all four other equivalent implementations (lines 124-125, 384-385, 670-671, 808-809) are brace-less. Either style is fine, but the deviation is unnecessary.🔧 Proposed fix to align with the other four implementations
for (let i = 1; i < length; i++) { - if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) { - return false; - } + if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) + return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 944 - 947, In the isValidSequence implementation used by characters_to_nfkd_binary/1, change the if statement that currently uses block braces around the single-line body to the brace-less style used at the other call sites (i.e. replace "if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) { return false; }" with the single-line form "if (!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so the style matches the other four equivalent implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 944-947: In the isValidSequence implementation used by
characters_to_nfkd_binary/1, change the if statement that currently uses block
braces around the single-line body to the brace-less style used at the other
call sites (i.e. replace "if (!Bitstring.isValidUtf8ContinuationByte(bytes[start
+ i])) { return false; }" with the single-line form "if
(!Bitstring.isValidUtf8ContinuationByte(bytes[start + i])) return false;") so
the style matches the other four equivalent implementations.
bc9948f to
a3c80d3
Compare
Closes #711
Dependencies
Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependant PR(s) are merged to dev branch then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.
Summary by CodeRabbit
New Features
Refactor
Tests