Extracts truncated UTF-8 sequence validation to BitString class#718
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds UTF‑8 decoding, validation, and truncation detection utilities to Bitstring and refactors the Erlang unicode module to use those utilities; duplicates of the new utilities appear in bitstring.mjs and tests for the utilities are added. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)
258-259: Consider hoisting constant lookup objects out of the method bodies.
firstByteMasks(Line 259) andminValueForLength(Line 602) are re-created on every call. If these methods are called in a tight loop (e.g., scanning long byte arrays), the repeated object allocation is unnecessary overhead. Hoisting them to module-level or static class-level constants eliminates this.♻️ Suggested change
+// Module-level constants for UTF-8 decoding +const UTF8_FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07}; +const UTF8_MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; + export default class Bitstring {Then use them in the respective methods:
static decodeUtf8CodePoint(bytes, start, length) { if (length === 1) return bytes[start]; - const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07}; - let codePoint = bytes[start] & firstByteMasks[length]; + let codePoint = bytes[start] & UTF8_FIRST_BYTE_MASKS[length];static isValidUtf8CodePoint(codePoint, encodingLength) { - const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; - if (codePoint < minValueForLength[encodingLength]) return false; + if (codePoint < UTF8_MIN_VALUE_FOR_LENGTH[encodingLength]) return false;Also applies to: 601-602
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 258 - 259, The objects firstByteMasks and minValueForLength are being re-created inside function bodies on every call causing unnecessary allocations; move them to module-level constants (e.g., define const FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const MIN_VALUE_FOR_LENGTH = {...}) and replace the in-function usages of firstByteMasks and minValueForLength with these hoisted constants in the functions that reference them (search for identifiers firstByteMasks and minValueForLength to update all occurrences).assets/js/erlang/unicode.mjs (1)
670-672: Nit: inconsistent comment casing.Line 670 uses lowercase "scan forward" while the other four copies (Lines 98, 309, 563, 776) use "Scan forward". Trivial, but one more reason to extract the function.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 670 - 672, The comment "scan forward, validating each sequence" is inconsistently cased compared to other occurrences; update this comment to "Scan forward, validating each sequence" to match the other instances (or better, extract the repeated logic into a helper function used by the blocks that declare let pos = 0 to remove duplicated comments). Locate the block containing the variable pos (the scanning loop) and either change the comment text to title case or refactor the scan logic into a single function (e.g., scanForward/validateSequence) and replace the duplicated comments with a single well-named function call.test/javascript/bitstring_test.mjs (1)
5267-5344: Clarify and de‑duplicate truncated‑sequence tests.
Two 4‑byte truncation tests use identical input, and the first test’s comment mentions a continuation byte that isn’t present. Consider removing the duplicate and aligning the comment or sample bytes for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/javascript/bitstring_test.mjs` around lines 5267 - 5344, The two tests asserting truncation for 4-byte sequences duplicate the same input ([0xf0, 0x90, 0x8d]) and the first test's comment is inconsistent; remove the duplicate test (either the one titled "returns true for truncated 4-byte sequence with valid continuation bytes" or "returns true for 4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single test that uses Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d], updating its comment to state "only 3 bytes available (2 valid continuation bytes)" so the test name, comment and sample bytes are aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The helper findValidUtf8Length is duplicated in five
places; move it into the Bitstring utility and call it from unicode.mjs instead
of inlining; specifically, add a method (e.g.,
Bitstring.findValidUtf8Length(bytes)) to bitstring.mjs that performs the same
loop using Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence,
export/attach it to the Bitstring class, then in unicode.mjs replace each
in-file findValidUtf8Length definition with direct calls to
Bitstring.findValidUtf8Length(bytes) from the functions characters_to_list/1,
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 258-259: The objects firstByteMasks and minValueForLength are
being re-created inside function bodies on every call causing unnecessary
allocations; move them to module-level constants (e.g., define const
FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const MIN_VALUE_FOR_LENGTH =
{...}) and replace the in-function usages of firstByteMasks and
minValueForLength with these hoisted constants in the functions that reference
them (search for identifiers firstByteMasks and minValueForLength to update all
occurrences).
In `@assets/js/erlang/unicode.mjs`:
- Around line 670-672: The comment "scan forward, validating each sequence" is
inconsistently cased compared to other occurrences; update this comment to "Scan
forward, validating each sequence" to match the other instances (or better,
extract the repeated logic into a helper function used by the blocks that
declare let pos = 0 to remove duplicated comments). Locate the block containing
the variable pos (the scanning loop) and either change the comment text to title
case or refactor the scan logic into a single function (e.g.,
scanForward/validateSequence) and replace the duplicated comments with a single
well-named function call.
In `@test/javascript/bitstring_test.mjs`:
- Around line 5267-5344: The two tests asserting truncation for 4-byte sequences
duplicate the same input ([0xf0, 0x90, 0x8d]) and the first test's comment is
inconsistent; remove the duplicate test (either the one titled "returns true for
truncated 4-byte sequence with valid continuation bytes" or "returns true for
4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single
test that uses Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d],
updating its comment to state "only 3 bytes available (2 valid continuation
bytes)" so the test name, comment and sample bytes are aligned.
6d289bb to
9ab7bc3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
assets/js/bitstring.mjs (1)
255-269: Object literalsfirstByteMasksandminValueForLengthare re-allocated on every call.Both
decodeUtf8CodePoint(line 259) andisValidUtf8CodePoint(line 602) construct a new plain-object lookup table on each invocation. These tables are constant and could be hoisted to module-level or class-levelconstvalues to avoid repeated allocation, especially since these methods are called inside tight loops during UTF-8 scanning.♻️ Proposed refactor (illustrative — apply equivalent changes to both methods)
+const FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07}; +const MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; static decodeUtf8CodePoint(bytes, start, length) { if (length === 1) return bytes[start]; - const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07}; - let codePoint = bytes[start] & firstByteMasks[length]; + let codePoint = bytes[start] & FIRST_BYTE_MASKS[length]; ... } static isValidUtf8CodePoint(codePoint, encodingLength) { - const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; - if (codePoint < minValueForLength[encodingLength]) return false; + if (codePoint < MIN_VALUE_FOR_LENGTH[encodingLength]) return false; ... }Also applies to: 596-612
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup objects into shared constants to avoid re-allocation: move the firstByteMasks and minValueForLength objects used in decodeUtf8CodePoint and isValidUtf8CodePoint out of the functions and define them once at module (or class static) scope as const FIRST_BYTE_MASKS and MIN_VALUE_FOR_LENGTH, then replace the in-function object literals with references to those constants; ensure you update both decodeUtf8CodePoint and isValidUtf8CodePoint to use the new constant names (and keep behavior identical).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 638-657: The function isTruncatedUtf8Sequence currently reads
bytes[start] without bounds checking, causing out-of-range starts to be treated
as a valid truncated UTF-8; add an explicit bounds guard at the top of
isTruncatedUtf8Sequence to immediately return false when start is outside the
buffer (start < 0 or start >= bytes.length) before calling
$.getUtf8SequenceLength or $.isValidUtf8ContinuationByte, keeping the rest of
the logic intact.
---
Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The function findValidUtf8Length is duplicated across the
five Unicode helper blocks; extract a single shared implementation (keep the
current behavior that returns a numeric pos and uses
Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) into one
reusable helper (e.g., a top-level/exported findValidUtf8Length in the module)
and replace the other verbatim copies by calling that shared helper; ensure the
helper name and signature remain findValidUtf8Length(bytes) returning a number
so callers in the existing Unicode helper functions continue to work without
further changes.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup objects into shared constants
to avoid re-allocation: move the firstByteMasks and minValueForLength objects
used in decodeUtf8CodePoint and isValidUtf8CodePoint out of the functions and
define them once at module (or class static) scope as const FIRST_BYTE_MASKS and
MIN_VALUE_FOR_LENGTH, then replace the in-function object literals with
references to those constants; ensure you update both decodeUtf8CodePoint and
isValidUtf8CodePoint to use the new constant names (and keep behavior
identical).
|
@coderabbitai Don't do any further reviews on this PR while it is in draft. |
|
✅ Actions performedReviews paused. |
Closes #714
Dependencies
Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.
Summary by CodeRabbit
New Features
Refactor
Tests