Extracts UTF-8 valid length position to Bitstring class#719
Extracts UTF-8 valid length position to Bitstring class#719mward-sudo wants to merge 6 commits intobartblast:devfrom
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCentralizes UTF-8 validation/decoding into six new static Bitstring methods and refactors assets/js/erlang/unicode.mjs to use them instead of duplicated inline scanners. Adds extensive tests covering the new UTF-8 helpers and edge cases. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)
255-261: Object literals allocated on every call in hot-path helpers.
firstByteMasksindecodeUtf8CodePoint(Line 259) andminValueForLengthinisValidUtf8CodePoint(Line 623) are re-created on each invocation. DuringgetValidUtf8Length's scan of a large byte array, these allocations occur once per sequence. Consider hoisting them to module-level or static private fields to avoid repeated allocation.♻️ Example: hoist to static private fields
export default class Bitstring { static `#decoder` = ERTS.utf8Decoder; static `#encoder` = new TextEncoder("utf-8"); + static `#utf8FirstByteMasks` = [0, 0, 0x1f, 0x0f, 0x07]; // index = length + static `#utf8MinCodePoints` = [0, 0, 0x80, 0x800, 0x10000]; // index = lengthThen use
$._utf8FirstByteMasks[length]and$._utf8MinCodePoints[length]respectively (adjusting for private-field access within the class).Also applies to: 621-626
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 255 - 261, Hoist the per-call object literals used for UTF-8 decoding into shared static/module-level constants to avoid allocating them on every invocation: replace the local firstByteMasks inside decodeUtf8CodePoint and the minValueForLength inside isValidUtf8CodePoint with a single shared constant (e.g., BitString._utf8FirstByteMasks and BitString._utf8MinCodePoints or top-level utf8FirstByteMasks/utf8MinCodePoints) and update references (decodeUtf8CodePoint, isValidUtf8CodePoint, and any callers like getValidUtf8Length) to use those static/module constants instead of recreating the objects each call.assets/js/erlang/unicode.mjs (1)
280-360: Pre-existing: near-identical scaffolding across four normalization functions.
validateListRest,handleConversionError, andhandleInvalidUtf8are essentially copy-pasted acrosscharacters_to_nfc_binary/1,characters_to_nfd_binary/1,characters_to_nfkc_binary/1, andcharacters_to_nfkd_binary/1, differing only in the normalization form string ("NFC","NFD", etc.). This isn't introduced by this PR, but a follow-up extracting a sharednormalizeCharactersToBinary(data, form)helper would eliminate ~200 lines of duplication.Also applies to: 512-594, 598-679, 683-763
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 280 - 360, The four functions characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1 and characters_to_nfkd_binary/1 duplicate validateListRest, handleConversionError and handleInvalidUtf8 with only the Unicode form string differing; extract a single helper normalizeCharactersToBinary(data, form) that contains those shared helpers and logic (use the existing validateListRest, handleConversionError and handleInvalidUtf8 bodies inside it), accept the normalization form ("NFC"/"NFD"/"NFKC"/"NFKD") and return the same results as the current implementations, then replace each of the four functions to simply call normalizeCharactersToBinary(data, "<FORM>") (keeping the same behavior and return shape). Ensure you reference and reuse Type, Bitstring, Erlang_Unicode calls exactly as in the originals so behavior is preserved.test/javascript/bitstring_test.mjs (1)
5393-5470: Avoid duplicated 4-byte truncation case; use a distinct length.
Two tests use the same byte sequence, so one can target a different truncation depth to add coverage.♻️ Suggested tweak
- it("returns true for 4-byte sequence with 2 valid continuation bytes (truncated)", () => { - // 0xF0 (4-byte) with 2 valid continuation bytes available - const bytes = new Uint8Array([0xf0, 0x90, 0x8d]); + it("returns true for 4-byte sequence with 1 valid continuation byte (truncated)", () => { + // 0xF0 (4-byte) with 1 valid continuation byte available + const bytes = new Uint8Array([0xf0, 0x90]); assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/javascript/bitstring_test.mjs` around lines 5393 - 5470, The test suite has a duplicated 4-byte-truncation input; update the "Edge case: multiple valid continuation bytes before truncation" test that references Bitstring.isTruncatedUtf8Sequence so it uses a distinct truncation depth (e.g., change its bytes from the duplicate [0xf0, 0x90, 0x8d] to a shorter truncated 4-byte case like [0xf0, 0x90] to represent only one continuation byte available) and keep the assertion expecting true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 661-676: The isTruncatedUtf8Sequence function can read past the
end of the bytes array when start >= bytes.length; add an explicit guard at the
top of isTruncatedUtf8Sequence to return false if start is out of bounds (start
>= bytes.length or start < 0) before accessing bytes[start], so
getUtf8SequenceLength isn't passed undefined and the rest of the logic remains
correct; reference the isTruncatedUtf8Sequence function and
getUtf8SequenceLength and validate the bytes array/start bounds.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-261: Hoist the per-call object literals used for UTF-8
decoding into shared static/module-level constants to avoid allocating them on
every invocation: replace the local firstByteMasks inside decodeUtf8CodePoint
and the minValueForLength inside isValidUtf8CodePoint with a single shared
constant (e.g., BitString._utf8FirstByteMasks and BitString._utf8MinCodePoints
or top-level utf8FirstByteMasks/utf8MinCodePoints) and update references
(decodeUtf8CodePoint, isValidUtf8CodePoint, and any callers like
getValidUtf8Length) to use those static/module constants instead of recreating
the objects each call.
In `@assets/js/erlang/unicode.mjs`:
- Around line 280-360: The four functions characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1 and
characters_to_nfkd_binary/1 duplicate validateListRest, handleConversionError
and handleInvalidUtf8 with only the Unicode form string differing; extract a
single helper normalizeCharactersToBinary(data, form) that contains those shared
helpers and logic (use the existing validateListRest, handleConversionError and
handleInvalidUtf8 bodies inside it), accept the normalization form
("NFC"/"NFD"/"NFKC"/"NFKD") and return the same results as the current
implementations, then replace each of the four functions to simply call
normalizeCharactersToBinary(data, "<FORM>") (keeping the same behavior and
return shape). Ensure you reference and reuse Type, Bitstring, Erlang_Unicode
calls exactly as in the originals so behavior is preserved.
In `@test/javascript/bitstring_test.mjs`:
- Around line 5393-5470: The test suite has a duplicated 4-byte-truncation
input; update the "Edge case: multiple valid continuation bytes before
truncation" test that references Bitstring.isTruncatedUtf8Sequence so it uses a
distinct truncation depth (e.g., change its bytes from the duplicate [0xf0,
0x90, 0x8d] to a shorter truncated 4-byte case like [0xf0, 0x90] to represent
only one continuation byte available) and keep the assertion expecting true.
a9c9f75 to
f458bf4
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (4)
assets/js/erlang/unicode.mjs (1)
320-328: Significant structural duplication across NFC/NFD/NFKC/NFKDhandleInvalidUtf8helpers.All four
handleInvalidUtf8closures (and their siblinghandleConversionError/validateListRest) are structurally identical, differing only in the normalization form string ("NFC","NFD","NFKC","NFKD"). A single shared factory or helper parameterized by the normalization form would eliminate ~100 lines of duplication.This duplication is largely pre-existing (not introduced by this PR), so deferring is fine — but worth tracking.
Also applies to: 553-562, 639-647, 724-732
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 320 - 328, Multiple closures (handleInvalidUtf8, handleConversionError, validateListRest) are duplicated for each normalization form ("NFC","NFD","NFKC","NFKD"); extract a single factory function (e.g., makeNormalizationHandlers or createNormalizer) that takes the normalization form string and returns the three handlers, then replace the four per-form closure groups with calls like const { handleInvalidUtf8, handleConversionError, validateListRest } = makeNormalizationHandlers("NFC"); ensure the factory uses Bitstring.getValidUtf8Length, Bitstring.fromBytes, Bitstring.toText and Type.bitstring/Type.tuple exactly as current implementations do so behavior is unchanged.assets/js/bitstring.mjs (2)
621-633: Same per-call allocation pattern; consider inlining the lookup.Similar to
decodeUtf8CodePoint,minValueForLengthis recreated each call. A ternary chain or static array would avoid the allocation.♻️ Inline the minimum-value lookup
static isValidUtf8CodePoint(codePoint, encodingLength) { - // Check for overlong encodings (security issue) - const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; - - // Reject code points that could have been encoded with fewer bytes (overlong) - if (codePoint < minValueForLength[encodingLength]) return false; + // Reject overlong encodings (security issue) + const minValue = encodingLength === 2 ? 0x80 : encodingLength === 3 ? 0x800 : encodingLength === 4 ? 0x10000 : 0; + if (codePoint < minValue) return false; // Reject UTF-16 surrogates (U+D800–U+DFFF) if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false; // Reject code points beyond Unicode range (> U+10FFFF) if (codePoint > 0x10ffff) return false; return true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 621 - 633, The isValidUtf8CodePoint function currently allocates a new object minValueForLength on every call; replace that per-call allocation by inlining the lookup (e.g., a small static array or ternary/switch) so you return the minimum value for encodingLength without creating an object each invocation. Update isValidUtf8CodePoint to use the precomputed array (or a 3-way ternary/switch) to obtain minValueForLength for the given encodingLength and keep the existing validation checks intact; you can mirror the approach used in decodeUtf8CodePoint to locate the replacement.
255-269: Consider avoiding per-call object allocation in a hot-path helper.
firstByteMasksis re-created on every invocation. SincegetValidUtf8LengthcallsisValidUtf8Sequence→decodeUtf8CodePointfor every character, this adds GC pressure on large inputs.♻️ Replace object lookup with inline conditional
static decodeUtf8CodePoint(bytes, start, length) { if (length === 1) return bytes[start]; - // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07 - const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07}; - - let codePoint = bytes[start] & firstByteMasks[length]; + // First byte mask: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07 + const mask = length === 2 ? 0x1f : length === 3 ? 0x0f : 0x07; + let codePoint = bytes[start] & mask; // Process continuation bytes (all use 0x3f mask, shift by 6 each) for (let i = 1; i < length; i++) { codePoint = (codePoint << 6) | (bytes[start + i] & 0x3f); } return codePoint; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 255 - 269, The helper decodeUtf8CodePoint currently allocates firstByteMasks on every call; replace that per-call object with a static constant or inline conditional to avoid allocation in the hot path (e.g., use a small switch or a module-level array/const to map length→mask) and update decodeUtf8CodePoint to use that constant/mapping; also audit callers getValidUtf8Length and isValidUtf8Sequence to ensure they continue to pass the same start/length semantics.test/javascript/bitstring_test.mjs (1)
5393-5470: Optional: de-duplicate one truncated-sequence test + align comments.Two tests use the same input for a truncated 4‑byte sequence, and the first test’s comment mentions a continuation byte that isn’t present. Consider removing the duplicate and clarifying the comment to match the data.
♻️ Suggested cleanup
- // Happy path: truncated 2-byte sequence - it("returns true for truncated 2-byte sequence with valid continuation byte", () => { - // 0xC2 requires 2 bytes, but only 1 byte available (0x80 is valid continuation) + // Happy path: truncated 2-byte sequence (leader only) + it("returns true for truncated 2-byte sequence with leader only", () => { + // 0xC2 requires 2 bytes, but only the leader is available const bytes = new Uint8Array([0xc2]); assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true); }); @@ - // Edge case: multiple valid continuation bytes before truncation - it("returns true for 4-byte sequence with 2 valid continuation bytes (truncated)", () => { - // 0xF0 (4-byte) with 2 valid continuation bytes available - const bytes = new Uint8Array([0xf0, 0x90, 0x8d]); - assert.equal(Bitstring.isTruncatedUtf8Sequence(bytes, 0), true); - });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/javascript/bitstring_test.mjs` around lines 5393 - 5470, Two tests call Bitstring.isTruncatedUtf8Sequence with the identical byte array for a truncated 4‑byte sequence; remove the duplicate test (the one titled "returns true for 4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single clearly-named test (e.g., "returns true for truncated 4-byte sequence with valid continuation bytes"), and update that test's comment to accurately state how many continuation bytes are present (leader 0xF0 plus two continuation bytes 0x90, 0x8D, i.e., 3 total bytes available) so the description matches the input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@assets/js/bitstring.mjs`:
- Around line 663-678: The function isTruncatedUtf8Sequence currently reads
bytes[start] without checking bounds; add an early guard in
isTruncatedUtf8Sequence that returns false when start is out of range (start >=
bytes.length or start < 0) before calling $.getUtf8SequenceLength or accessing
bytes[start], preserving the existing logic for expectedLength and
continuation-byte checks.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 621-633: The isValidUtf8CodePoint function currently allocates a
new object minValueForLength on every call; replace that per-call allocation by
inlining the lookup (e.g., a small static array or ternary/switch) so you return
the minimum value for encodingLength without creating an object each invocation.
Update isValidUtf8CodePoint to use the precomputed array (or a 3-way
ternary/switch) to obtain minValueForLength for the given encodingLength and
keep the existing validation checks intact; you can mirror the approach used in
decodeUtf8CodePoint to locate the replacement.
- Around line 255-269: The helper decodeUtf8CodePoint currently allocates
firstByteMasks on every call; replace that per-call object with a static
constant or inline conditional to avoid allocation in the hot path (e.g., use a
small switch or a module-level array/const to map length→mask) and update
decodeUtf8CodePoint to use that constant/mapping; also audit callers
getValidUtf8Length and isValidUtf8Sequence to ensure they continue to pass the
same start/length semantics.
In `@assets/js/erlang/unicode.mjs`:
- Around line 320-328: Multiple closures (handleInvalidUtf8,
handleConversionError, validateListRest) are duplicated for each normalization
form ("NFC","NFD","NFKC","NFKD"); extract a single factory function (e.g.,
makeNormalizationHandlers or createNormalizer) that takes the normalization form
string and returns the three handlers, then replace the four per-form closure
groups with calls like const { handleInvalidUtf8, handleConversionError,
validateListRest } = makeNormalizationHandlers("NFC"); ensure the factory uses
Bitstring.getValidUtf8Length, Bitstring.fromBytes, Bitstring.toText and
Type.bitstring/Type.tuple exactly as current implementations do so behavior is
unchanged.
In `@test/javascript/bitstring_test.mjs`:
- Around line 5393-5470: Two tests call Bitstring.isTruncatedUtf8Sequence with
the identical byte array for a truncated 4‑byte sequence; remove the duplicate
test (the one titled "returns true for 4-byte sequence with 2 valid continuation
bytes (truncated)") and keep a single clearly-named test (e.g., "returns true
for truncated 4-byte sequence with valid continuation bytes"), and update that
test's comment to accurately state how many continuation bytes are present
(leader 0xF0 plus two continuation bytes 0x90, 0x8D, i.e., 3 total bytes
available) so the description matches the input.
|
@coderabbitai Don't do any further reviews on this PR while it is in draft. |
|
✅ Actions performedReviews paused. |
Closes #703
Dependencies
Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.
Summary by CodeRabbit
New Features
Refactor
Tests