Skip to content

Fix COPY FROM encoding error double-counting and enable SREH for transcoding#1597

Merged
avamingli merged 1 commit intoapache:mainfrom
avamingli:issue_1425
Mar 3, 2026
Merged

Fix COPY FROM encoding error double-counting and enable SREH for transcoding#1597
avamingli merged 1 commit intoapache:mainfrom
avamingli:issue_1425

Conversation

@avamingli
Copy link
Contributor

COPY FROM with SEGMENT REJECT LIMIT had two bugs when encountering invalid multi-byte encoding sequences:

  1. Encoding errors were double-counted: HandleCopyError() incremented rejectcount, then RemoveInvalidDataInBuf() incremented it again for the same error. This caused the reject limit to be reached twice as fast as expected.

  2. SREH (Single Row Error Handling) was completely disabled when transcoding was required (file encoding != database encoding). Any encoding error during transcoding would raise an ERROR instead of skipping the bad row.

Fix by removing the duplicate rejectcount++ from RemoveInvalidDataInBuf(), removing the !need_transcoding guard that blocked SREH for transcoding, and adding proper buffer cleanup for the transcoding case (advance raw_buf past the bad line using FindEolInUnverifyRawBuf).

Add regression tests covering both non-transcoding (invalid UTF-8) and transcoding (invalid EUC_CN to UTF-8) cases with various reject limits.

Fixes #1425

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


…scoding

COPY FROM with SEGMENT REJECT LIMIT had two bugs when encountering
invalid multi-byte encoding sequences:

1. Encoding errors were double-counted: HandleCopyError() incremented
   rejectcount, then RemoveInvalidDataInBuf() incremented it again for
   the same error. This caused the reject limit to be reached twice as
   fast as expected.

2. SREH (Single Row Error Handling) was completely disabled when
   transcoding was required (file encoding != database encoding). Any
   encoding error during transcoding would raise an ERROR instead of
   skipping the bad row.

Fix by removing the duplicate rejectcount++ from RemoveInvalidDataInBuf(),
removing the !need_transcoding guard that blocked SREH for transcoding,
and adding proper buffer cleanup for the transcoding case (advance
raw_buf past the bad line using FindEolInUnverifyRawBuf).

Add regression tests covering both non-transcoding (invalid UTF-8) and
transcoding (invalid EUC_CN to UTF-8) cases with various reject limits.

Fixes apache#1425
@avamingli avamingli merged commit c5a298d into apache:main Mar 3, 2026
41 checks passed
@avamingli avamingli deleted the issue_1425 branch March 3, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Foreign table(COPY FROM) can't skip lines for invalid multi-byte-encoding text

3 participants