-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Lacking api documentation OR bug
When specifying an offset alongside a string, I expected the offset to refer to character positions—but it’s actually interpreted as byte positions.
Referring to API at:
- https://github.com/crispthinking/IronRe2/blob/b0281fde906b4606784b582ca47ca132564c6445/src/IronRe2/Regex.cs#L188C3-L188C51
Line 288 in b0281fd
public Captures Captures(string haystack, int offset)
Expected Behavior
- Offsets supplied for strings should be interpreted as character unit offsets.
- The regex engine should correctly map character offsets to the starting boundaries of UTF-8 code points.
Impact
- No issues when all characters are ASCII (single-byte UTF-8).
- For code points encoded as multiple bytes, the byte-based offset unintendedly may fall in the middle of a code point, causing the match to fail.
Proposed Solution
Mapping character offsets to byte offsets
public Match Find(string haystack, int offset)
{
var hayBytes = Encoding.UTF8.GetBytes(haystack);
var byteOffset = Encoding.UTF8.GetByteCount(text.AsSpan(0, offset));
return Find(hayBytes, offset);
}Improving documentation
Explicitly note and state warning that the offset refers to bytes:
/// <param name="haystack">The string to search for the pattern</param>
/// <param name="offset">The offest to start searching from **in the UTF-8 encoded haystack**</param>
/// <returns>The captures data</returns>
public Match Find(string haystack, int offset)Removing unintuitive api
Since the api that allows string input is just a wrapper for the api using bytes, disallowing the string input would only increase the code needed to call it by one more line, being the string to byte conversion: Encoding.UTF8.GetBytes(haystack).
Having to do the encoding to byte array themselves would make it clear what the offsets are referring to.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels