Conversation
- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup - 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯
- 新增 ColorCharError dataclass,記錄非法顏色字元錯誤 - 實作 COLOR_PATTERN 正則:& 後只能接 a-v(不含 w)、0-9、空格、\、# - check_color_chars():檢查單一字串中的非法顏色字元 - check_json_file():讀取 JSON 並遞迴檢查所有字串值 - check_directory():遞迴檢查目錄下所有 .json 檔 - 遵循現有 checkers Generator yield 模式
…ipelines Commit 1: feat(rich-text-shield) - add core module - New module: translation_tool/plugins/shared/rich_text_shield.py - ShieldPiece / ShieldedText dataclasses - shield_text(): 抽出7種不應翻譯的格式片段(彩色碼/物品ID/URL/逸出\&/圖片/事件JSON/翻頁) - unshield_text(): 還原所有佔位符 - add_escape_quotes(): JSON逸出修補(移植自FTBQL) - Updated shared/__init__.py 導出新模組 Commit 2: fix(kubejs-lm) - integrate shield/unshield into LM translation pipeline - collect_items_from_mapping(): shield_text() 寫入 _shielded,skip_reason 非None時保留原文 - on_translated_item(): unshield_text() 還原翻譯後文字 Commit 3: fix(kubejs-clean) - integrate shield into s2t pipeline (Phase 2) - kubejs_translator_clean.py: _shielded_convert() helper,保護 safe_convert_text_fn 呼叫點:deep_merge_3way_flat_impl() 和 client_scripts 處理 - 避免 OpenCC s2t 轉換時破壞 KubeJS 格式標記
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
- md_extract_qa: shield_text before writing pending JSON - md_inject_qa: unshield_text before writing back to MD - Item dataclass: add _shields field for shield restoration - _shield_item helper for clean shield/unshield roundtrip
prevent ShieldedText object from leaking into json.dumps()
Greptile SummaryThis PR introduces The core design is sound and the KubeJS/MD integrations are largely correct, but two integrations contain bugs that prevent the shielding from working:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant shield_text
participant LM as LM Translator
participant unshield_text
Note over Caller,unshield_text: ✅ KubeJS / MD pipeline (correct)
Caller->>shield_text: shield_text(src)
shield_text-->>Caller: ShieldedText(clean, shields, skip_reason)
alt skip_reason is not None
Caller->>Caller: keep original, skip LM
else needs translation
Caller->>LM: translate(ShieldedText.clean)
LM-->>Caller: translated_text (with placeholders)
Caller->>unshield_text: unshield_text(translated_text, shielded.shields)
unshield_text-->>Caller: final restored text
end
Note over Caller,unshield_text: ❌ FTBQuests pipeline (broken)
Caller->>LM: translate(src_text) ← original, NOT shielded
LM-->>Caller: translated_text (no placeholders)
Caller->>shield_text: shield_text(src_text) ← re-shield AFTER translation
shield_text-->>Caller: ShieldedText (new counters, new placeholders)
Caller->>unshield_text: unshield_text(translated_text, shielded_src) ← ShieldedText not .shields
Note right of unshield_text: TypeError caught silently,<br/>unshielding never runs
Reviews (1): Last reviewed commit: "fix(md): strip _shielded before JSON ser..." | Re-trigger Greptile |
| if isinstance(p, str) and isinstance(t, str): | ||
| try: | ||
| shielded_src = shield_text(src_text) | ||
| t = unshield_text(t, shielded_src) |
There was a problem hiding this comment.
ShieldedText passed instead of .shields — unshielding silently never runs
unshield_text expects list[ShieldPiece] as its second argument, but shielded_src here is a ShieldedText dataclass. When sorted() tries to iterate over a ShieldedText instance it raises TypeError: 'ShieldedText' object is not iterable, which is then silently swallowed by the bare except Exception: pass. The net result is that every FTBQuests translation callback leaves color/item-ID placeholders un-restored in t.
Even if the type were fixed, this approach has a second problem: the source text is never shielded before being sent to the LM, so t will never contain placeholders like $C0$ to restore. The shielding must happen at item-collection time (as in the KubeJS path) and the resulting ShieldedText must travel with the item to the callback.
| t = unshield_text(t, shielded_src) | |
| shielded_src = shield_text(src_text) | |
| t = unshield_text(t, shielded_src.shields) |
| if shielded.skip_reason is not None: | ||
| # 不應翻譯(圖片/URL/事件/空白),直接寫入原文不經翻譯管線 | ||
| items.append( | ||
| { | ||
| "file": file_hint, | ||
| "path": k, | ||
| "source_text": v, | ||
| "text": v, # 保持原文 | ||
| "cache_type": "kubejs", | ||
| "_shielded": shielded, # 供 unshield 回查(此情境無需還原) | ||
| "_skip_reason": shielded.skip_reason, | ||
| } | ||
| ) |
There was a problem hiding this comment.
skip_reason items still flow into the LM translation pipeline
The comment on line 88 reads "直接寫入原文不經翻譯管線" (write original directly, bypassing the translation pipeline), but these items are appended to the same items list that is returned, fed into fast_split_items_by_cache, and — on a cache miss — added to all_miss_items which is sent to the LM. The _skip_reason flag is set on the dict but is never checked anywhere downstream before dispatching.
In practice this means URLs, image paths, and event JSON fragments are still sent to the LM for translation, wasting API calls and risking corruption of those values if the model translates them.
To honour the skip intent, either filter these items out of all_miss_items before the LM call, or handle them as immediate "cache hits" so they bypass the translation loop entirely.
| _counter_color: int = 0 | ||
| _counter_item: int = 0 | ||
| _counter_escaped: int = 0 |
There was a problem hiding this comment.
Global mutable counters are not thread-safe
_counter_color, _counter_item, and _counter_escaped are module-level integers incremented inside shield_text without any locking. If two threads call shield_text concurrently (which is plausible in a multi-file batch translation scenario), they will race on the same counter values, producing identical placeholders in two separate texts.
When unshield_text then runs on both results, it restores the wrong originals: e.g. $C0$ in file A's translated text could be replaced with file B's original color code.
Consider either:
- Using a
threading.Lockto guard the counter increments, or - Replacing the global counters with a per-call local counter inside
shield_text(the placeholder only needs to be unique within a single text, not globally).
| ITEM_ID_PATTERN = re.compile(r"#[a-z0-9_.\-]+[:/][a-z0-9_.\-]+", re.IGNORECASE) | ||
|
|
||
| # 標準彩色碼:&a ~ &o(不含 k 的 16 進位格式碼) | ||
| COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE) |
There was a problem hiding this comment.
&r (reset code) is not shielded
The comment says "標準彩色碼:&a ~ &o" but [a-f0-9k-o] omits r, which is the standard Minecraft reset code (§r / &r). Meanwhile color_char_checker.py's COLOR_PATTERN uses [^a-vz0-9\s\\#], treating &r as a legal character. This mismatch means &r passes the checker as legal but is not protected by the shield — it would be left in the "clean" text for the LM to potentially corrupt.
| COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE) | |
| COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-or]", re.IGNORECASE) |
| _shields: list = None | ||
|
|
||
| def __post_init__(self): | ||
| if self._shields is None: | ||
| self._shields = [] |
There was a problem hiding this comment.
Non-idiomatic mutable default in dataclass field
Using = None with a __post_init__ guard works, but Python dataclasses provide field(default_factory=list) for exactly this purpose, which is cleaner and the standard pattern:
| _shields: list = None | |
| def __post_init__(self): | |
| if self._shields is None: | |
| self._shields = [] | |
| _shields: list = field(default_factory=list) |
With field(default_factory=list) the __post_init__ method is no longer needed and can be removed.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
014b988 to
a264928
Compare
實作摘要
改了什麼
ich_text_shield.py 共享模組
為什麼要改
怎麼驗證的
檔案變更