Extracts UTF-8 code point validation to BitString class#716
Extracts UTF-8 code point validation to BitString class#716mward-sudo wants to merge 3 commits intobartblast:devfrom
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR extracts UTF-8 handling utilities into the Bitstring class and refactors unicode.mjs to use these centralized helpers instead of local implementations. New static methods for decoding code points and validating UTF-8 sequences are added to Bitstring, with corresponding test coverage. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
assets/js/erlang/unicode.mjs (2)
332-447: Consider extracting the four normalization-binary functions into a shared parameterized helper.
characters_to_nfc_binary/1,characters_to_nfd_binary/1,characters_to_nfkc_binary/1, andcharacters_to_nfkd_binary/1are each ~115 lines of nearly identical code. The only variance is the normalization form string ("NFC","NFD","NFKC","NFKD"). Each duplicatesfindValidUtf8Length,validateListRest,handleConversionError,handleInvalidUtf8, and the main logic.Since this PR's theme is centralizing shared UTF-8 logic, this would be a natural follow-up: extract a single
charactersToNormalizedBinary(data, form)helper and have each public function delegate to it. This would reduce ~460 duplicated lines to ~130.♻️ Sketch of the refactored approach
+ // Shared helper for all normalization binary conversions + _charactersToNormalizedBinary: (data, normalizationForm) => { + const findValidUtf8Length = (bytes) => { /* shared implementation */ }; + const validateListRest = (rest) => { /* shared implementation */ }; + const handleConversionError = (tag, prefix, rest) => { + const textPrefix = Bitstring.toText(prefix); + const normalizedPrefix = + textPrefix === false + ? prefix + : Type.bitstring(textPrefix.normalize(normalizationForm)); + // ... rest of shared logic + }; + const handleInvalidUtf8 = (bytes) => { + // ... uses normalizationForm parameter + }; + // ... shared main logic + }, + "characters_to_nfc_binary/1": (data) => { - // ~115 lines of inline logic + return Erlang_Unicode._charactersToNormalizedBinary(data, "NFC"); }, "characters_to_nfd_binary/1": (data) => { - // ~115 lines of inline logic + return Erlang_Unicode._charactersToNormalizedBinary(data, "NFD"); }, // ... same for NFKC, NFKDAlso applies to: 600-715, 720-833, 838-953
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 332 - 447, The four nearly identical functions characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same UTF‑8 validation and normalization logic; extract a single helper (e.g., charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length, validateListRest, handleConversionError, handleInvalidUtf8, and the shared main logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter, then have each public wrapper call that helper with the appropriate form string so the duplicated blocks in characters_to_nfc_binary/1 and the other three functions are removed and replaced by simple delegations.
854-856: Minor style inconsistency: braces around single-statementreturn false.This is the only
isValidUtf8ContinuationBytecall site that wraps thereturn falsein braces. All other identical call sites (lines 106, 129, 348, 616, 736) use the brace-less form.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 854 - 856, The call site that checks Bitstring.isValidUtf8ContinuationByte currently uses braces around the single-statement "return false"; update that conditional to use the brace-less single-line form (i.e., remove the surrounding { } so it matches the other call sites that call Bitstring.isValidUtf8ContinuationByte) in the same function where bytes[start + i] is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 250-269: Add defensive validation at the start of
decodeUtf8CodePoint: verify that bytes is present and has a .length, that length
is an integer between 1 and 4, and that start is a non-negative integer with
start + length <= bytes.length; if any check fails throw a RangeError (with a
clear message including the invalid values). Keep the existing decoding logic
unchanged after these checks so callers still get the same behavior when
arguments are valid.
- Around line 596-617: isValidUtf8CodePoint currently trusts encodingLength and
can silently skip the overlong check if encodingLength is invalid; add a
defensive guard at the start of isValidUtf8CodePoint to verify encodingLength is
an integer 1..4 (or otherwise a valid key for minValueForLength) and immediately
return false for any invalid value (this prevents using undefined
minValueForLength[encodingLength] and ensures callers like getUtf8SequenceLength
cannot cause silent failures).
---
Nitpick comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 332-447: The four nearly identical functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
UTF‑8 validation and normalization logic; extract a single helper (e.g.,
charactersToNormalizedBinary(data, form)) that contains findValidUtf8Length,
validateListRest, handleConversionError, handleInvalidUtf8, and the shared main
logic, taking the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") as a parameter,
then have each public wrapper call that helper with the appropriate form string
so the duplicated blocks in characters_to_nfc_binary/1 and the other three
functions are removed and replaced by simple delegations.
- Around line 854-856: The call site that checks
Bitstring.isValidUtf8ContinuationByte currently uses braces around the
single-statement "return false"; update that conditional to use the brace-less
single-line form (i.e., remove the surrounding { } so it matches the other call
sites that call Bitstring.isValidUtf8ContinuationByte) in the same function
where bytes[start + i] is validated.
|
@coderabbitai Don't do any further reviews on this PR while it is in draft. |
|
✅ Actions performedReviews paused. |
Closes #712
Dependencies
Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.
Summary by CodeRabbit
New Features
Tests