Skip to content

fix: comprehensive CJK support for pattern detection#22

Closed
matsukaze-minden wants to merge 2 commits intoBayramAnnakov:mainfrom
matsukaze-minden:fix/cjk-comprehensive-support
Closed

fix: comprehensive CJK support for pattern detection#22
matsukaze-minden wants to merge 2 commits intoBayramAnnakov:mainfrom
matsukaze-minden:fix/cjk-comprehensive-support

Conversation

@matsukaze-minden
Copy link
Copy Markdown

Summary

Extends #19's scope to provide comprehensive CJK (Chinese, Japanese, Korean) support for pattern detection — covering both false positive prevention and correction pattern detection.

  • Support full-width and CJK question particles (嗎吗呢か까) in false positive filter
  • Add CJK-aware short message threshold (2 chars for CJK vs 4 for ASCII)
  • Add non-correction English phrase filter (No problem, don't worry, never mind, etc.) to prevent false positives in mixed CJK-English text
  • Add 13 CJK correction patterns: Japanese (8), Chinese (3), Korean (2)
  • 22 new tests covering all changes

Problem

The existing pattern detection is English-centric, causing two classes of issues for CJK users:

1. False positives (English patterns triggering on non-corrections)

Input Matched pattern Actual intent
No problem, 次に進もう ^no[,. ]+ Agreement, not correction
don't worry、大丈夫 ^don't\b Reassurance, not correction
Never mind、別の方法でやろう ^never\b Dismissal, not correction
何ですか? (not matched by \?$) Question with full-width

2. False negatives (CJK corrections not detected)

Input Expected detection Actual
いや、そっちじゃなくて Correction ("no, not that") Not detected
違う、useStateじゃなくて Correction ("wrong, not useState") Not detected
それ間違ってる Correction ("that's wrong") Not detected
不是,应该用另一个方法 Correction ("no, use another method") Not detected

Changes

scripts/lib/reflect_utils.py

False positive prevention:

  • FALSE_POSITIVE_PATTERNS: \?$[?\uff1f]$ + CJK question particles
  • NON_CORRECTION_PHRASES: New list of 8 English phrases that look like correction openers but aren't (No problem, don't worry, never mind, etc.)
  • detect_patterns(): CJK-aware short message threshold (≤2 chars for CJK, ≤4 for ASCII)

CJK correction detection:

  • CJK_CORRECTION_PATTERNS: 13 patterns across 3 languages
    • Japanese (8): いや、違う、そうじゃなくて、間違ってる、じゃなくて〜にして、やめて、そうじゃない、って言った
    • Chinese (3): 不是、错了/錯了、不要X要Y
    • Korean (2): 아니、틀렸
  • Integrated into detect_patterns() flow between false positive check and English correction check

tests/test_reflect_utils.py

  • New TestCJKPatternDetection class with 22 tests:
    • Short message rejection (ASCII and CJK-aware thresholds)
    • Full-width question mark and CJK question particles (ja/zh/ko)
    • Non-correction English phrases in mixed text
    • All 13 CJK correction patterns
    • English pattern regression tests

Relationship to #19

This PR is a superset of #19. All three changes from #19 are included:

  • ✅ Full-width support
  • ✅ CJK question particles
  • ✅ Short message rejection

Plus additional improvements:

Test plan

  • All 94 tests pass (72 existing + 22 new)
  • No behavior change for existing English detection patterns
  • Verified with real-world mixed Japanese-English prompts
  • CJK-aware threshold correctly handles single-character CJK vs multi-char ASCII

🤖 Generated with Claude Code

matsukaze-minden and others added 2 commits March 1, 2026 08:13
Extends #19's scope to cover Japanese, Chinese, and Korean correction
detection — not just false positive filtering.

Three areas of improvement:

1. False positive prevention (superset of #19):
   - Full-width question mark (?) and CJK question particles (嗎吗呢か까)
   - Short message rejection with CJK-aware threshold (2 chars for CJK vs 4 for ASCII)
   - Non-correction English phrases ("No problem", "don't worry", "never mind", etc.)
     that previously triggered CORRECTION_PATTERNS in mixed CJK-English text

2. CJK correction pattern detection (new):
   - Japanese: いや、違う、そうじゃなくて、間違ってる、じゃなくて〜にして、やめて、って言った
   - Chinese: 不是、错了/錯了、不要X要Y
   - Korean: 아니、틀렸

3. Test coverage:
   - 22 new tests in TestCJKPatternDetection class
   - Short message rejection, question particles, non-correction phrases
   - All CJK correction patterns with confidence assertions
   - English pattern regression tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- いや違う→ も訂正として検出
- 違うアプローチ等の名詞修飾を除外(句読点のみマッチ)
- No way! を非訂正フレーズに追加

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shohu
Copy link
Copy Markdown
Contributor

shohu commented Feb 28, 2026

Closing: will resubmit from correct account.

@matsukaze-minden
Copy link
Copy Markdown
Author

Closing: resubmitting from correct account (shohu).

@matsukaze-minden matsukaze-minden deleted the fix/cjk-comprehensive-support branch February 28, 2026 23:23
@shohu
Copy link
Copy Markdown
Contributor

shohu commented Feb 28, 2026

Could you please delete this PR? It was submitted from the wrong account by mistake. The final version is #24. Sorry for the clutter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants