Skip to content

feat(shared): 新增 rich_text_shield 脫殼模組#42

Open
jlin53882 wants to merge 8 commits intomainfrom
pr/rich-text-shield
Open

feat(shared): 新增 rich_text_shield 脫殼模組#42
jlin53882 wants to merge 8 commits intomainfrom
pr/rich-text-shield

Conversation

@jlin53882
Copy link
Copy Markdown
Owner

實作摘要

改了什麼

  • 新增
    ich_text_shield.py 共享模組
  • 提供 shield_text() / unshield_text() 配對函式
  • 支援:物品ID (#namespace/item)、彩色碼 (&a~f / &#RRGGBB)、超連結、逸出 \&、圖片副檔名、特殊事件 JSON、翻頁事件
  • 整合進 kubejs_tooltip_lmtranslator.py 和 kubejs_translator_clean.py 管線

為什麼要改

  • KubeJS 翻譯時,部分格式片段不應被翻譯(如物品ID、顏色碼)
  • 需要在翻譯前「脫殼」保護這些片段,翻譯後「還原」

怎麼驗證的

  • 單元測試驗證 shield/unshield 往返一致性
  • 本地翻譯測試確認整合進管線後輸出正確

檔案變更

  • ranslation_tool/plugins/shared/rich_text_shield.py (+266 行)
  • ranslation_tool/plugins/shared/init.py (+12 行)
  • ranslation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py (+59 行)
  • ranslation_tool/core/kubejs_translator_clean.py (整合)

- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup
- 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending
- 避免同一英文原文因不同 key 而重複翻譯
- 新增 ColorCharError dataclass,記錄非法顏色字元錯誤
- 實作 COLOR_PATTERN 正則:& 後只能接 a-v(不含 w)、0-9、空格、\、#
- check_color_chars():檢查單一字串中的非法顏色字元
- check_json_file():讀取 JSON 並遞迴檢查所有字串值
- check_directory():遞迴檢查目錄下所有 .json 檔
- 遵循現有 checkers Generator yield 模式
…ipelines

Commit 1: feat(rich-text-shield) - add core module
- New module: translation_tool/plugins/shared/rich_text_shield.py
  - ShieldPiece / ShieldedText dataclasses
  - shield_text(): 抽出7種不應翻譯的格式片段(彩色碼/物品ID/URL/逸出\&/圖片/事件JSON/翻頁)
  - unshield_text(): 還原所有佔位符
  - add_escape_quotes(): JSON逸出修補(移植自FTBQL)
- Updated shared/__init__.py 導出新模組

Commit 2: fix(kubejs-lm) - integrate shield/unshield into LM translation pipeline
- collect_items_from_mapping(): shield_text() 寫入 _shielded,skip_reason 非None時保留原文
- on_translated_item(): unshield_text() 還原翻譯後文字

Commit 3: fix(kubejs-clean) - integrate shield into s2t pipeline (Phase 2)
- kubejs_translator_clean.py: _shielded_convert() helper,保護 safe_convert_text_fn
  呼叫點:deep_merge_3way_flat_impl() 和 client_scripts 處理
- 避免 OpenCC s2t 轉換時破壞 KubeJS 格式標記
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

- md_extract_qa: shield_text before writing pending JSON
- md_inject_qa: unshield_text before writing back to MD
- Item dataclass: add _shields field for shield restoration
- _shield_item helper for clean shield/unshield roundtrip
prevent ShieldedText object from leaking into json.dumps()
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR introduces rich_text_shield.py, a shield/unshield layer that extracts un-translatable KubeJS rich-text tokens (color codes, item IDs, URLs, etc.) before translation and restores them afterward. The module is integrated into the KubeJS tooltip, KubeJS core (s2t OpenCC), Markdown, and FTBQuests translation pipelines.

The core design is sound and the KubeJS/MD integrations are largely correct, but two integrations contain bugs that prevent the shielding from working:

  • P0 — FTBQuests unshield call passes the wrong type (ftbquests_lmtranslator.py:474): unshield_text(t, shielded_src) passes a ShieldedText object where list[ShieldPiece] is expected. This causes a TypeError that is silently swallowed, so unshielding never runs. Additionally, the FTBQuests source text is never pre-shielded before being sent to the LM, meaning placeholders are never inserted into the text that the LM sees — so even a type-fixed call would be a structural no-op.
  • P1 — KubeJS skip-reason items still reach the LM (kubejs_tooltip_lmtranslator.py:87-99): Items marked with _skip_reason (URLs, images, event JSON) are appended to the shared items list and travel all the way to all_miss_items, which is sent to the LM. The _skip_reason flag is never consumed downstream to filter them out.
  • P1 — Global counters are not thread-safe (rich_text_shield.py:88-90): Module-level _counter_color, _counter_item, _counter_escaped are incremented without locking. Concurrent calls from a multi-file batch can race and generate the same placeholder IDs, causing unshield_text to restore the wrong originals.
  • P2 — &r reset code missing from COLOR_CODE_PATTERN (rich_text_shield.py:23): The pattern [a-f0-9k-o] does not include r, so the standard Minecraft reset token &r is not shielded and may be corrupted by the LM.

Confidence Score: 2/5

  • Not safe to merge — the FTBQuests integration has a P0 bug that causes silent unshield failures on every translation, and skip-reason items are still sent to the LM in the KubeJS pipeline.
  • The core module and Markdown/KubeJS-core integrations are well-structured, but two of the four pipeline integrations have correctness bugs (wrong type passed to unshield_text, skip-reason items not filtered before LM dispatch) and the global counter design introduces a thread-safety hazard. These issues are not cosmetic — they directly undermine the feature's stated goal of protecting format tokens from LM translation.
  • translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py and translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py need the most attention; translation_tool/plugins/shared/rich_text_shield.py also needs the thread-safety and &r fixes.

Important Files Changed

Filename Overview
translation_tool/plugins/shared/rich_text_shield.py New core module implementing shield/unshield for KubeJS rich-text tokens; logic is sound but global mutable counters are not thread-safe and &r reset code is missing from COLOR_CODE_PATTERN.
translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py P0 bug: unshield_text(t, shielded_src) passes a ShieldedText object instead of shielded_src.shields, causing a silently-caught TypeError so unshielding never executes; additionally, source text is never pre-shielded before the LM call, so the entire approach is structurally incomplete.
translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py Items with skip_reason (URLs, images, event JSON) are appended to the same items list and reach the LM translation loop despite comments claiming they bypass the pipeline; _skip_reason is never consumed downstream.
translation_tool/core/kubejs_translator_clean.py Correctly integrates _shielded_convert for OpenCC s2t conversion; shield/unshield pattern is properly applied and a reverse-index dedup block is added for pending_en filtering.
translation_tool/plugins/md/md_extract_qa.py Adds _shield_item helper that serialises ShieldPiece dicts into the pending JSON; straightforward and correct given asdict was already imported.
translation_tool/plugins/md/md_inject_qa.py Deserialises _shields into ShieldPiece objects on load and correctly calls unshield_text at apply time; minor style issue with _shields: list = None instead of field(default_factory=list).
translation_tool/plugins/md/md_lmtranslator.py Shield/unshield integration looks correct; _shielded key is filtered from dry-run preview serialisation and unshielding is applied in both the cache-hit path and the on_result callback.
translation_tool/plugins/shared/init.py Re-exports new public symbols from rich_text_shield; straightforward, no issues.
translation_tool/checkers/color_char_checker.py New standalone checker for illegal & color codes in JSON files; logic is clear and exception handling in check_json_file correctly avoids blocking on parse failures.
workspace/PR_designs/PR42c_md_shield_impl_summary.md Design document summarising the Markdown shield integration; no code changes.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant shield_text
    participant LM as LM Translator
    participant unshield_text

    Note over Caller,unshield_text: ✅ KubeJS / MD pipeline (correct)
    Caller->>shield_text: shield_text(src)
    shield_text-->>Caller: ShieldedText(clean, shields, skip_reason)
    alt skip_reason is not None
        Caller->>Caller: keep original, skip LM
    else needs translation
        Caller->>LM: translate(ShieldedText.clean)
        LM-->>Caller: translated_text (with placeholders)
        Caller->>unshield_text: unshield_text(translated_text, shielded.shields)
        unshield_text-->>Caller: final restored text
    end

    Note over Caller,unshield_text: ❌ FTBQuests pipeline (broken)
    Caller->>LM: translate(src_text) ← original, NOT shielded
    LM-->>Caller: translated_text (no placeholders)
    Caller->>shield_text: shield_text(src_text) ← re-shield AFTER translation
    shield_text-->>Caller: ShieldedText (new counters, new placeholders)
    Caller->>unshield_text: unshield_text(translated_text, shielded_src) ← ShieldedText not .shields
    Note right of unshield_text: TypeError caught silently,<br/>unshielding never runs
Loading

Reviews (1): Last reviewed commit: "fix(md): strip _shielded before JSON ser..." | Re-trigger Greptile

if isinstance(p, str) and isinstance(t, str):
try:
shielded_src = shield_text(src_text)
t = unshield_text(t, shielded_src)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 ShieldedText passed instead of .shields — unshielding silently never runs

unshield_text expects list[ShieldPiece] as its second argument, but shielded_src here is a ShieldedText dataclass. When sorted() tries to iterate over a ShieldedText instance it raises TypeError: 'ShieldedText' object is not iterable, which is then silently swallowed by the bare except Exception: pass. The net result is that every FTBQuests translation callback leaves color/item-ID placeholders un-restored in t.

Even if the type were fixed, this approach has a second problem: the source text is never shielded before being sent to the LM, so t will never contain placeholders like $C0$ to restore. The shielding must happen at item-collection time (as in the KubeJS path) and the resulting ShieldedText must travel with the item to the callback.

Suggested change
t = unshield_text(t, shielded_src)
shielded_src = shield_text(src_text)
t = unshield_text(t, shielded_src.shields)

Comment on lines +87 to +99
if shielded.skip_reason is not None:
# 不應翻譯(圖片/URL/事件/空白),直接寫入原文不經翻譯管線
items.append(
{
"file": file_hint,
"path": k,
"source_text": v,
"text": v, # 保持原文
"cache_type": "kubejs",
"_shielded": shielded, # 供 unshield 回查(此情境無需還原)
"_skip_reason": shielded.skip_reason,
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 skip_reason items still flow into the LM translation pipeline

The comment on line 88 reads "直接寫入原文不經翻譯管線" (write original directly, bypassing the translation pipeline), but these items are appended to the same items list that is returned, fed into fast_split_items_by_cache, and — on a cache miss — added to all_miss_items which is sent to the LM. The _skip_reason flag is set on the dict but is never checked anywhere downstream before dispatching.

In practice this means URLs, image paths, and event JSON fragments are still sent to the LM for translation, wasting API calls and risking corruption of those values if the model translates them.

To honour the skip intent, either filter these items out of all_miss_items before the LM call, or handle them as immediate "cache hits" so they bypass the translation loop entirely.

Comment on lines +88 to +90
_counter_color: int = 0
_counter_item: int = 0
_counter_escaped: int = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Global mutable counters are not thread-safe

_counter_color, _counter_item, and _counter_escaped are module-level integers incremented inside shield_text without any locking. If two threads call shield_text concurrently (which is plausible in a multi-file batch translation scenario), they will race on the same counter values, producing identical placeholders in two separate texts.

When unshield_text then runs on both results, it restores the wrong originals: e.g. $C0$ in file A's translated text could be replaced with file B's original color code.

Consider either:

  • Using a threading.Lock to guard the counter increments, or
  • Replacing the global counters with a per-call local counter inside shield_text (the placeholder only needs to be unique within a single text, not globally).

ITEM_ID_PATTERN = re.compile(r"#[a-z0-9_.\-]+[:/][a-z0-9_.\-]+", re.IGNORECASE)

# 標準彩色碼:&a ~ &o(不含 k 的 16 進位格式碼)
COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 &r (reset code) is not shielded

The comment says "標準彩色碼:&a ~ &o" but [a-f0-9k-o] omits r, which is the standard Minecraft reset code (§r / &r). Meanwhile color_char_checker.py's COLOR_PATTERN uses [^a-vz0-9\s\\#], treating &r as a legal character. This mismatch means &r passes the checker as legal but is not protected by the shield — it would be left in the "clean" text for the LM to potentially corrupt.

Suggested change
COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-o]", re.IGNORECASE)
COLOR_CODE_PATTERN = re.compile(r"&[a-f0-9k-or]", re.IGNORECASE)

Comment on lines +234 to +238
_shields: list = None

def __post_init__(self):
if self._shields is None:
self._shields = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-idiomatic mutable default in dataclass field

Using = None with a __post_init__ guard works, but Python dataclasses provide field(default_factory=list) for exactly this purpose, which is cleaner and the standard pattern:

Suggested change
_shields: list = None
def __post_init__(self):
if self._shields is None:
self._shields = []
_shields: list = field(default_factory=list)

With field(default_factory=list) the __post_init__ method is no longer needed and can be removed.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant