Skip to content

Its not clear that offsets in string haystack refer to byte offsets #13

@aHilsberg

Description

@aHilsberg

Lacking api documentation OR bug

When specifying an offset alongside a string, I expected the offset to refer to character positions—but it’s actually interpreted as byte positions.

Referring to API at:

Expected Behavior

  • Offsets supplied for strings should be interpreted as character unit offsets.
  • The regex engine should correctly map character offsets to the starting boundaries of UTF-8 code points.

Impact

  • No issues when all characters are ASCII (single-byte UTF-8).
  • For code points encoded as multiple bytes, the byte-based offset unintendedly may fall in the middle of a code point, causing the match to fail.

Proposed Solution

Mapping character offsets to byte offsets

    public Match Find(string haystack, int offset)
    {
        var hayBytes = Encoding.UTF8.GetBytes(haystack);
        var byteOffset = Encoding.UTF8.GetByteCount(text.AsSpan(0, offset));
        return Find(hayBytes, offset);
    }

Improving documentation

Explicitly note and state warning that the offset refers to bytes:

   /// <param name="haystack">The string to search for the pattern</param>
    /// <param name="offset">The offest to start searching from **in the UTF-8 encoded haystack**</param>
    /// <returns>The captures data</returns>
    public Match Find(string haystack, int offset)

Removing unintuitive api

Since the api that allows string input is just a wrapper for the api using bytes, disallowing the string input would only increase the code needed to call it by one more line, being the string to byte conversion: Encoding.UTF8.GetBytes(haystack).
Having to do the encoding to byte array themselves would make it clear what the offsets are referring to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions