fix: comprehensive CJK support for pattern detection#22
Closed
matsukaze-minden wants to merge 2 commits intoBayramAnnakov:mainfrom
Closed
fix: comprehensive CJK support for pattern detection#22matsukaze-minden wants to merge 2 commits intoBayramAnnakov:mainfrom
matsukaze-minden wants to merge 2 commits intoBayramAnnakov:mainfrom
Conversation
Extends #19's scope to cover Japanese, Chinese, and Korean correction detection — not just false positive filtering. Three areas of improvement: 1. False positive prevention (superset of #19): - Full-width question mark (?) and CJK question particles (嗎吗呢か까) - Short message rejection with CJK-aware threshold (2 chars for CJK vs 4 for ASCII) - Non-correction English phrases ("No problem", "don't worry", "never mind", etc.) that previously triggered CORRECTION_PATTERNS in mixed CJK-English text 2. CJK correction pattern detection (new): - Japanese: いや、違う、そうじゃなくて、間違ってる、じゃなくて〜にして、やめて、って言った - Chinese: 不是、错了/錯了、不要X要Y - Korean: 아니、틀렸 3. Test coverage: - 22 new tests in TestCJKPatternDetection class - Short message rejection, question particles, non-correction phrases - All CJK correction patterns with confidence assertions - English pattern regression tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- いや違う→ も訂正として検出 - 違うアプローチ等の名詞修飾を除外(句読点のみマッチ) - No way! を非訂正フレーズに追加 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Closing: will resubmit from correct account. |
Author
|
Closing: resubmitting from correct account (shohu). |
Contributor
|
Could you please delete this PR? It was submitted from the wrong account by mistake. The final version is #24. Sorry for the clutter. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends #19's scope to provide comprehensive CJK (Chinese, Japanese, Korean) support for pattern detection — covering both false positive prevention and correction pattern detection.
?and CJK question particles (嗎吗呢か까) in false positive filterNo problem,don't worry,never mind, etc.) to prevent false positives in mixed CJK-English textProblem
The existing pattern detection is English-centric, causing two classes of issues for CJK users:
1. False positives (English patterns triggering on non-corrections)
No problem, 次に進もう^no[,. ]+don't worry、大丈夫^don't\bNever mind、別の方法でやろう^never\b何ですか?\?$)?2. False negatives (CJK corrections not detected)
いや、そっちじゃなくて違う、useStateじゃなくてそれ間違ってる不是,应该用另一个方法Changes
scripts/lib/reflect_utils.pyFalse positive prevention:
FALSE_POSITIVE_PATTERNS:\?$→[?\uff1f]$+ CJK question particlesNON_CORRECTION_PHRASES: New list of 8 English phrases that look like correction openers but aren't (No problem,don't worry,never mind, etc.)detect_patterns(): CJK-aware short message threshold (≤2 chars for CJK, ≤4 for ASCII)CJK correction detection:
CJK_CORRECTION_PATTERNS: 13 patterns across 3 languagesdetect_patterns()flow between false positive check and English correction checktests/test_reflect_utils.pyTestCJKPatternDetectionclass with 22 tests:Relationship to #19
This PR is a superset of #19. All three changes from #19 are included:
?supportPlus additional improvements:
Test plan
🤖 Generated with Claude Code