feat(shared): 新增 rich_text_shield 脫殼模組 by jlin53882 · Pull Request #42 · jlin53882/Minecraft-translate

jlin53882 · 2026-03-23T06:27:36Z

實作摘要

改了什麼

新增
ich_text_shield.py 共享模組
提供 shield_text() / unshield_text() 配對函式
支援：物品ID (#namespace/item)、彩色碼 (&a~f / &#RRGGBB)、超連結、逸出 \&、圖片副檔名、特殊事件 JSON、翻頁事件
整合進 kubejs_tooltip_lmtranslator.py 和 kubejs_translator_clean.py 管線

為什麼要改

KubeJS 翻譯時，部分格式片段不應被翻譯（如物品ID、顏色碼）
需要在翻譯前「脫殼」保護這些片段，翻譯後「還原」

怎麼驗證的

單元測試驗證 shield/unshield 往返一致性
本地翻譯測試確認整合進管線後輸出正確

檔案變更

ranslation_tool/plugins/shared/rich_text_shield.py (+266 行)
ranslation_tool/plugins/shared/init.py (+12 行)
ranslation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py (+59 行)
ranslation_tool/core/kubejs_translator_clean.py (整合)

- 在 clean_kubejs_from_raw_impl 中，pending_en 寫入前新增 reverse_index dedup - 若某英文文字（value）已出現在 final/zh_tw.json（不同 key），則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯

- 新增 ColorCharError dataclass，記錄非法顏色字元錯誤 - 實作 COLOR_PATTERN 正則：& 後只能接 a-v（不含 w）、0-9、空格、\、# - check_color_chars()：檢查單一字串中的非法顏色字元 - check_json_file()：讀取 JSON 並遞迴檢查所有字串值 - check_directory()：遞迴檢查目錄下所有 .json 檔 - 遵循現有 checkers Generator yield 模式

…ipelines Commit 1: feat(rich-text-shield) - add core module - New module: translation_tool/plugins/shared/rich_text_shield.py - ShieldPiece / ShieldedText dataclasses - shield_text(): 抽出7種不應翻譯的格式片段（彩色碼/物品ID/URL/逸出\&/圖片/事件JSON/翻頁） - unshield_text(): 還原所有佔位符 - add_escape_quotes(): JSON逸出修補（移植自FTBQL） - Updated shared/__init__.py 導出新模組 Commit 2: fix(kubejs-lm) - integrate shield/unshield into LM translation pipeline - collect_items_from_mapping(): shield_text() 寫入 _shielded，skip_reason 非None時保留原文 - on_translated_item(): unshield_text() 還原翻譯後文字 Commit 3: fix(kubejs-clean) - integrate shield into s2t pipeline (Phase 2) - kubejs_translator_clean.py: _shielded_convert() helper，保護 safe_convert_text_fn 呼叫點：deep_merge_3way_flat_impl() 和 client_scripts 處理 - 避免 OpenCC s2t 轉換時破壞 KubeJS 格式標記

chatgpt-codex-connector · 2026-03-23T06:27:42Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

- md_extract_qa: shield_text before writing pending JSON - md_inject_qa: unshield_text before writing back to MD - Item dataclass: add _shields field for shield restoration - _shield_item helper for clean shield/unshield roundtrip

prevent ShieldedText object from leaking into json.dumps()

greptile-apps · 2026-03-24T12:48:25Z

Greptile Summary

This PR introduces rich_text_shield.py, a shield/unshield layer that extracts un-translatable KubeJS rich-text tokens (color codes, item IDs, URLs, etc.) before translation and restores them afterward. The module is integrated into the KubeJS tooltip, KubeJS core (s2t OpenCC), Markdown, and FTBQuests translation pipelines.

The core design is sound and the KubeJS/MD integrations are largely correct, but two integrations contain bugs that prevent the shielding from working:

P0 — FTBQuests unshield call passes the wrong type (ftbquests_lmtranslator.py:474): unshield_text(t, shielded_src) passes a ShieldedText object where list[ShieldPiece] is expected. This causes a TypeError that is silently swallowed, so unshielding never runs. Additionally, the FTBQuests source text is never pre-shielded before being sent to the LM, meaning placeholders are never inserted into the text that the LM sees — so even a type-fixed call would be a structural no-op.
P1 — KubeJS skip-reason items still reach the LM (kubejs_tooltip_lmtranslator.py:87-99): Items marked with _skip_reason (URLs, images, event JSON) are appended to the shared items list and travel all the way to all_miss_items, which is sent to the LM. The _skip_reason flag is never consumed downstream to filter them out.
P1 — Global counters are not thread-safe (rich_text_shield.py:88-90): Module-level _counter_color, _counter_item, _counter_escaped are incremented without locking. Concurrent calls from a multi-file batch can race and generate the same placeholder IDs, causing unshield_text to restore the wrong originals.
P2 — &r reset code missing from COLOR_CODE_PATTERN (rich_text_shield.py:23): The pattern [a-f0-9k-o] does not include r, so the standard Minecraft reset token &r is not shielded and may be corrupted by the LM.

Confidence Score: 2/5

Not safe to merge — the FTBQuests integration has a P0 bug that causes silent unshield failures on every translation, and skip-reason items are still sent to the LM in the KubeJS pipeline.
The core module and Markdown/KubeJS-core integrations are well-structured, but two of the four pipeline integrations have correctness bugs (wrong type passed to unshield_text, skip-reason items not filtered before LM dispatch) and the global counter design introduces a thread-safety hazard. These issues are not cosmetic — they directly undermine the feature's stated goal of protecting format tokens from LM translation.
translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py and translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py need the most attention; translation_tool/plugins/shared/rich_text_shield.py also needs the thread-safety and &r fixes.

Important Files Changed

Filename	Overview
translation_tool/plugins/shared/rich_text_shield.py	New core module implementing shield/unshield for KubeJS rich-text tokens; logic is sound but global mutable counters are not thread-safe and `&r` reset code is missing from COLOR_CODE_PATTERN.
translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py	P0 bug: `unshield_text(t, shielded_src)` passes a `ShieldedText` object instead of `shielded_src.shields`, causing a silently-caught TypeError so unshielding never executes; additionally, source text is never pre-shielded before the LM call, so the entire approach is structurally incomplete.
translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py	Items with `skip_reason` (URLs, images, event JSON) are appended to the same `items` list and reach the LM translation loop despite comments claiming they bypass the pipeline; `_skip_reason` is never consumed downstream.
translation_tool/core/kubejs_translator_clean.py	Correctly integrates `_shielded_convert` for OpenCC s2t conversion; shield/unshield pattern is properly applied and a reverse-index dedup block is added for pending_en filtering.
translation_tool/plugins/md/md_extract_qa.py	Adds `_shield_item` helper that serialises `ShieldPiece` dicts into the pending JSON; straightforward and correct given `asdict` was already imported.
translation_tool/plugins/md/md_inject_qa.py	Deserialises `_shields` into `ShieldPiece` objects on load and correctly calls `unshield_text` at apply time; minor style issue with `_shields: list = None` instead of `field(default_factory=list)`.
translation_tool/plugins/md/md_lmtranslator.py	Shield/unshield integration looks correct; `_shielded` key is filtered from dry-run preview serialisation and unshielding is applied in both the cache-hit path and the `on_result` callback.
translation_tool/plugins/shared/init.py	Re-exports new public symbols from `rich_text_shield`; straightforward, no issues.
translation_tool/checkers/color_char_checker.py	New standalone checker for illegal `&` color codes in JSON files; logic is clear and exception handling in `check_json_file` correctly avoids blocking on parse failures.
workspace/PR_designs/PR42c_md_shield_impl_summary.md	Design document summarising the Markdown shield integration; no code changes.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant shield_text
    participant LM as LM Translator
    participant unshield_text

    Note over Caller,unshield_text: ✅ KubeJS / MD pipeline (correct)
    Caller->>shield_text: shield_text(src)
    shield_text-->>Caller: ShieldedText(clean, shields, skip_reason)
    alt skip_reason is not None
        Caller->>Caller: keep original, skip LM
    else needs translation
        Caller->>LM: translate(ShieldedText.clean)
        LM-->>Caller: translated_text (with placeholders)
        Caller->>unshield_text: unshield_text(translated_text, shielded.shields)
        unshield_text-->>Caller: final restored text
    end

    Note over Caller,unshield_text: ❌ FTBQuests pipeline (broken)
    Caller->>LM: translate(src_text) ← original, NOT shielded
    LM-->>Caller: translated_text (no placeholders)
    Caller->>shield_text: shield_text(src_text) ← re-shield AFTER translation
    shield_text-->>Caller: ShieldedText (new counters, new placeholders)
    Caller->>unshield_text: unshield_text(translated_text, shielded_src) ← ShieldedText not .shields
    Note right of unshield_text: TypeError caught silently,<br/>unshielding never runs

_{Reviews (1): Last reviewed commit: "fix(md): strip _shielded before JSON ser..." | Re-trigger Greptile}

greptile-apps · 2026-03-24T12:48:28Z

translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py

            if isinstance(p, str) and isinstance(t, str):
+                try:
+                    shielded_src = shield_text(src_text)
+                    t = unshield_text(t, shielded_src)


ShieldedText passed instead of .shields — unshielding silently never runs

unshield_text expects list[ShieldPiece] as its second argument, but shielded_src here is a ShieldedText dataclass. When sorted() tries to iterate over a ShieldedText instance it raises TypeError: 'ShieldedText' object is not iterable, which is then silently swallowed by the bare except Exception: pass. The net result is that every FTBQuests translation callback leaves color/item-ID placeholders un-restored in t.

Even if the type were fixed, this approach has a second problem: the source text is never shielded before being sent to the LM, so t will never contain placeholders like $C0$ to restore. The shielding must happen at item-collection time (as in the KubeJS path) and the resulting ShieldedText must travel with the item to the callback.

Suggested change

t = unshield_text(t, shielded_src)

shielded_src = shield_text(src_text)

t = unshield_text(t, shielded_src.shields)

greptile-apps · 2026-03-24T12:48:29Z

translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py

+        if shielded.skip_reason is not None:
+            # 不應翻譯（圖片/URL/事件/空白），直接寫入原文不經翻譯管線
+            items.append(
+                {
+                    "file": file_hint,
+                    "path": k,
+                    "source_text": v,
+                    "text": v,  # 保持原文
+                    "cache_type": "kubejs",
+                    "_shielded": shielded,  # 供 unshield 回查（此情境無需還原）
+                    "_skip_reason": shielded.skip_reason,
+                }
+            )


skip_reason items still flow into the LM translation pipeline

The comment on line 88 reads "直接寫入原文不經翻譯管線" (write original directly, bypassing the translation pipeline), but these items are appended to the same items list that is returned, fed into fast_split_items_by_cache, and — on a cache miss — added to all_miss_items which is sent to the LM. The _skip_reason flag is set on the dict but is never checked anywhere downstream before dispatching.

In practice this means URLs, image paths, and event JSON fragments are still sent to the LM for translation, wasting API calls and risking corruption of those values if the model translates them.

To honour the skip intent, either filter these items out of all_miss_items before the LM call, or handle them as immediate "cache hits" so they bypass the translation loop entirely.

greptile-apps · 2026-03-24T12:48:30Z

translation_tool/plugins/shared/rich_text_shield.py

+_counter_color: int = 0
+_counter_item: int = 0
+_counter_escaped: int = 0


Global mutable counters are not thread-safe

_counter_color, _counter_item, and _counter_escaped are module-level integers incremented inside shield_text without any locking. If two threads call shield_text concurrently (which is plausible in a multi-file batch translation scenario), they will race on the same counter values, producing identical placeholders in two separate texts.

When unshield_text then runs on both results, it restores the wrong originals: e.g. $C0$ in file A's translated text could be replaced with file B's original color code.

Consider either:

Using a threading.Lock to guard the counter increments, or

Replacing the global counters with a per-call local counter inside shield_text (the placeholder only needs to be unique within a single text, not globally).

greptile-apps · 2026-03-24T12:48:31Z

translation_tool/plugins/shared/rich_text_shield.py

+ITEM_ID_PATTERN = re.compile(r"#[a-z0-9_.\-]+[:/][a-z0-9_.\-]+", re.IGNORECASE)
+
+# 標準彩色碼：&a ~ &o（不含 k 的 16 進位格式碼）
+COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE)


&r (reset code) is not shielded

The comment says "標準彩色碼：&a ~ &o" but [a-f0-9k-o] omits r, which is the standard Minecraft reset code (§r / &r). Meanwhile color_char_checker.py's COLOR_PATTERN uses [^a-vz0-9\s\\#], treating &r as a legal character. This mismatch means &r passes the checker as legal but is not protected by the shield — it would be left in the "clean" text for the LM to potentially corrupt.

Suggested change

COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE)

COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-or]", re.IGNORECASE)

greptile-apps · 2026-03-24T12:48:32Z

translation_tool/plugins/md/md_inject_qa.py

+    _shields: list = None
+
+    def __post_init__(self):
+        if self._shields is None:
+            self._shields = []


Non-idiomatic mutable default in dataclass field

Using = None with a __post_init__ guard works, but Python dataclasses provide field(default_factory=list) for exactly this purpose, which is cleaner and the standard pattern:

Suggested change

_shields: list = None

def __post_init__(self):

if self._shields is None:

self._shields = []

_shields: list = field(default_factory=list)

With field(default_factory=list) the __post_init__ method is no longer needed and can be removed.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

jlin53882 added 3 commits March 23, 2026 14:10

jlin53882 added 5 commits March 23, 2026 14:55

style: apply ruff format

0b07775

feat(shared): extend rich text shield to FTB and MD translators

5678eba

fix(md): integrate rich text shield in markdown translator

3f52edd

fix(md): strip _shielded before JSON serialization in dry-run preview

a264928

prevent ShieldedText object from leaking into json.dumps()

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

jlin53882 force-pushed the main branch from c9836ca to e87226d Compare March 27, 2026 16:01

jlin53882 force-pushed the pr/rich-text-shield branch from 014b988 to a264928 Compare March 27, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(shared): 新增 rich_text_shield 脫殼模組#42

feat(shared): 新增 rich_text_shield 脫殼模組#42
jlin53882 wants to merge 8 commits intomainfrom
pr/rich-text-shield

jlin53882 commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	t = unshield_text(t, shielded_src)
	shielded_src = shield_text(src_text)
	t = unshield_text(t, shielded_src.shields)

	COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE)
	COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-or]", re.IGNORECASE)

Conversation

jlin53882 commented Mar 23, 2026

實作摘要

改了什麼

為什麼要改

怎麼驗證的

檔案變更

Uh oh!

chatgpt-codex-connector bot commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 24, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant