fix: Code Review 全量修復 (28 issues, 2026-03-25)#43
Conversation
- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup - 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯
- 新增 ColorCharError dataclass,記錄非法顏色字元錯誤 - 實作 COLOR_PATTERN 正則:& 後只能接 a-v(不含 w)、0-9、空格、\、# - check_color_chars():檢查單一字串中的非法顏色字元 - check_json_file():讀取 JSON 並遞迴檢查所有字串值 - check_directory():遞迴檢查目錄下所有 .json 檔 - 遵循現有 checkers Generator yield 模式
…ipelines Commit 1: feat(rich-text-shield) - add core module - New module: translation_tool/plugins/shared/rich_text_shield.py - ShieldPiece / ShieldedText dataclasses - shield_text(): 抽出7種不應翻譯的格式片段(彩色碼/物品ID/URL/逸出\&/圖片/事件JSON/翻頁) - unshield_text(): 還原所有佔位符 - add_escape_quotes(): JSON逸出修補(移植自FTBQL) - Updated shared/__init__.py 導出新模組 Commit 2: fix(kubejs-lm) - integrate shield/unshield into LM translation pipeline - collect_items_from_mapping(): shield_text() 寫入 _shielded,skip_reason 非None時保留原文 - on_translated_item(): unshield_text() 還原翻譯後文字 Commit 3: fix(kubejs-clean) - integrate shield into s2t pipeline (Phase 2) - kubejs_translator_clean.py: _shielded_convert() helper,保護 safe_convert_text_fn 呼叫點:deep_merge_3way_flat_impl() 和 client_scripts 處理 - 避免 OpenCC s2t 轉換時破壞 KubeJS 格式標記
- md_extract_qa: shield_text before writing pending JSON - md_inject_qa: unshield_text before writing back to MD - Item dataclass: add _shields field for shield restoration - _shield_item helper for clean shield/unshield roundtrip
prevent ShieldedText object from leaking into json.dumps()
- Previously session.finish() was only called in the success path - Now it's in the finally block to ensure it always executes - This prevents session leaks when exceptions occur
- 移除 if save_path.exists() 檢查,改用 try-except 捕捉 FileNotFoundError - 避免 TOCTOU (Time-of-Check to Time-of-Use) 問題:檢查與讀取之間檔案可能被刪除
- lm_service.py: added flush() after session.set_error() in exception handler - extract_service.py: added flush() in both run_lang_extraction_service and run_book_extraction_service exception handlers - This ensures buffered logs are flushed even when exceptions occur
- 新增 _counter_lock (threading.Lock) 保護模組層級計數器 - _next_color_placeholder / _next_item_placeholder / _next_escaped_placeholder 的讀取-遞增-寫入操作以鎖保護,避免 race condition
- RE_LANG_SEG 只定義一次(在第 61 行) - 移除第 87 行的重複定義,保留單一宣告
…s shielded_src.shields The unshield_text() function expects shields (list[ShieldPiece]), not the whole ShieldedText object. Fixed line 78 to pass shielded_src.shields. Issue: #1b
The cache hit path was missing unshield_text() call while the cache miss path properly called unshield_text(). Added shield handling to cache hit path to unify both code paths. Issue: #18
Added explicit check for empty shard files (0 bytes) in load_shard_file(). Previously empty files would cause JSONDecodeError and be logged as generic failure. Now they are specifically detected and logged as empty shard warnings. Issue: #19
… add_to_cache Added add_to_cache_batch() function that takes multiple entries and acquires the lock once for all entries, rather than acquiring/releasing for each entry. This reduces lock contention when many cache entries are added rapidly. Issue: #23
#13 - except Exception: pass 改為 logger.warning 記錄錯誤 - ftbquests_lmtranslator.py: 7處 - kubejs_tooltip_lmtranslator.py: 6處 - md_lmtranslator.py: 2處 - lm_translator_shared_loop.py: 5處 - cache_loader.py: 1處 - cache_search.py: 4處 #14 - cache_manager.py 初始化鎖定 - reload_translation_cache() now holds lock during reset+reload #15 - cache_shards.py TOCTOU Race 防護 - _save_entries_to_active_shards() 加入檔案鎖保護讀取-修改-寫入循環 #16 - cache_shards.py fsync 已存在(確認無需修改) #17 - md_lmtranslator.py skip_reason 處理 - 新增 skip_skipped 計數器 - skip_reason 非 None 的項目直接視為 cache hit
…t lock - lm_api_client.py: move API key from URL query to Authorization Bearer header (security) - lm_translator_main.py: convert dict system prompt to string before API call - cache_manager.py: add cache_lock to initialize_translation_cache (race condition fix)
- Issue #7: ftbquests_lmtranslator.py - Cache JSON mappings to avoid duplicate reads - Issue #7b: kubejs_tooltip_lmtranslator.py - Same fix for duplicate JSON reads - Issue #8: ftbquests_lmtranslator.py - Move callbacks outside loop (factory functions) - Issue #11: md_lmtranslator.py - Fix re-shielding logic (remove invalid else branch) - Issue #12: lm_response_parser.py - Use non-greedy regex to avoid matching invalid JSON # Conflicts: # translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py # translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py # translation_tool/plugins/md/md_lmtranslator.py
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Greptile SummaryThis PR claims to fix 28 issues identified in a code review report, spanning critical security fixes, logic corrections, and a broad shield/unshield rich-text protection layer integrated into the KubeJS, FTBQuests, and Markdown translation pipelines. While many lower-priority fixes (exception logging, Issues from previous review threads that remain unresolved:
New issues introduced by this PR:
Confidence Score: 1/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Source Text\n(e.g. 'Hello &a World')"] --> B["md_extract_qa.py\n_shield_item()"]
B --> |"shield_text() → $C0$\nstores _shields in JSON"| C["Pending JSON\ntext: 'Hello $C0$ World'\n_shields: [$C0$→&a]"]
C --> D["md_lmtranslator.py\nhash_to_src[hash] = it.text"]
D --> E["shield_text(already-shielded)\n⚠️ No patterns match\nshields = [] ← REDUNDANT"]
E --> F["LLM receives\n'Hello $C0$ World'"]
F --> G["LLM returns\n'你好 $C0$ 世界'"]
G --> H{"_shielded.shields\nempty?"}
H --> |"Yes (always)\nelse branch"| I["shield_text(src_text)\n⚠️ No-op — fresh\nplaceholders not in dst"]
H --> |"No (never reached)"| J["unshield_text(dst, shields)"]
I --> K["hash_to_dst[h] = '你好 $C0$ 世界'"]
K --> L["md_inject_qa.py\napply_item_to_md_lines()"]
L --> M["unshield_text(item.text, item._shields)\n✅ $C0$ → &a\n= '你好 &a 世界'"]
style E fill:#ffe0b2,stroke:#e65100
style I fill:#ffe0b2,stroke:#e65100
style H fill:#ffe0b2,stroke:#e65100
|
|
|
||
| matches = re.findall(r"\{[\s\S]*\}", text) | ||
| # ✅ Issue #12 修復:使用 non-greedy regex 避免匹配無效的多重 JSON | ||
| matches = re.findall(r"\{[\s\S]*?\}", text) |
There was a problem hiding this comment.
Non-greedy regex breaks nested JSON objects
The switch from greedy \{[\s\S]*\} to non-greedy \{[\s\S]*?\} introduces a regression for the fallback path. The non-greedy quantifier stops at the first } it finds, which for any nested JSON object will produce an incomplete, invalid fragment.
For example, given the text:
Here is my translation: {"key1": "value", "nested": {"k": "v"}} done
- Greedy regex matches:
{"key1": "value", "nested": {"k": "v"}}✅ (valid JSON) - Non-greedy regex matches:
{"key1": "value", "nested": {"k": "v"}❌ (stops at inner}, missing outer})
Since json.loads then fails on that partial match, and re.findall has already consumed the position, no valid JSON is returned and the function throws RuntimeError("JSON 解析失敗").
LLM responses that embed translation dictionaries (which this project relies on) commonly use nested objects. The correct fix for the original issue is to keep the greedy match but additionally try the largest valid candidate first:
| matches = re.findall(r"\{[\s\S]*?\}", text) | |
| matches = re.findall(r"\{[\s\S]*\}", text) |
| 使用 fsync 確保資料寫入磁碟,避免作業系統緩衝區未 flush | ||
| 就執行 os.replace() 導致資料遺失。 | ||
| """ | ||
| import msvcrt |
There was a problem hiding this comment.
Windows-only
msvcrt module used without cross-platform fallback
msvcrt is a Windows-only standard library module. Importing it on Linux or macOS raises ModuleNotFoundError: No module named 'msvcrt' at runtime, making the entire cache_shards.py module — and by extension the whole translation cache — unusable on non-Windows systems.
import msvcrt appears in four places in this file:
_write_json_atomic(line 25) — unused in that function (the actual call isos.fsync(), which is POSIX)_rotate_shard_if_needed(line 90 insidetry, line 113 insidefinally)_save_entries_to_active_shards(line 134)
For true cross-platform file locking the standard approach is to use fcntl.flock on POSIX and msvcrt.locking on Windows, with a runtime check:
import sys
if sys.platform == "win32":
import msvcrt
def _lock_fd(fd): msvcrt.locking(fd, msvcrt.LK_LOCK, 1)
def _unlock_fd(fd): msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
else:
import fcntl
def _lock_fd(fd): fcntl.flock(fd, fcntl.LOCK_EX)
def _unlock_fd(fd): fcntl.flock(fd, fcntl.LOCK_UN)Additionally, the import msvcrt on line 25 (_write_json_atomic) serves no purpose since that function only calls os.fsync() and should be removed regardless.
| try: | ||
| shielded_src = shield_text(src_text) | ||
| t = unshield_text(t, shielded_src.shields) | ||
| except Exception as e: | ||
| log_info(f"[FTB-LM] unshield 失敗: {e}") |
There was a problem hiding this comment.
Shield/unshield is a no-op — FTB items are never shielded before translation
The on_translated_item callback calls shield_text(src_text) after the LLM has already translated t. This freshly-created ShieldedText contains brand-new globally-counted placeholders (e.g. $C42$, $P7$) that were never injected into the source text sent to the model, so they cannot exist in the translated result t. The subsequent unshield_text(t, shielded_src.shields) therefore always returns t unchanged — the entire shield operation is a silent no-op.
Compare this to the correct implementation in kubejs_tooltip_lmtranslator.py, where shielding happens during item collection in collect_items_from_mapping and the ShieldedText object is stored in the item dict under _shielded:
# kubejs — correct pattern:
shielded = shield_text(v)
item["text"] = shielded.clean # shielded text sent to LM
item["_shielded"] = shielded # stored for later unshield
# in on_translated_item:
shielded = it.get("_shielded")
if shielded and shielded.shields:
final_text = unshield_text(t, shielded.shields)To properly protect FTB source text, shielding needs to be applied in the item-collection helper (around line 96–110 in ftbquests_lmtranslator.py) so that item["text"] carries the cleaned text and item["_shielded"] carries the shield list — matching the KubeJS approach.
| headers = { | ||
| "Content-Type": "application/json", | ||
| "Authorization": f"Bearer {api_key}", | ||
| } |
There was a problem hiding this comment.
Wrong authentication header breaks all Gemini API calls
The Google Gemini REST API requires plain API keys to be passed via the x-goog-api-key header, as shown in the official API reference. The Authorization: Bearer pattern is the OAuth2 flow for short-lived access tokens, not static API keys — these are entirely different credential types.
Sending a static API key in an Authorization: Bearer header will produce HTTP 401 Unauthorized on every request, silently breaking all translation calls.
The previous URL query-string approach was functionally correct. The improvement here should be using x-goog-api-key as the header name (without any Bearer prefix), per the Gemini documentation.
- lm_config_rules.py: 將 [_-a-zA-Z0-9] 改為 [a-zA-Z0-9_-],避免 - 被解讀為 range operator - test_lm_config_rules.py: 將假 key 改為 40 字合規格式 - test_pipeline_services_error_handling.py: 改為檢查 calls[-2] 預期 set_error
| if is_filled_text_impl(v): | ||
| reverse_index.setdefault(v, []).append(k) | ||
|
|
||
| # 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key | ||
| pending_en = { | ||
| k: v | ||
| for k, v in pending_en.items() | ||
| if not ( | ||
| is_filled_text_impl(v) | ||
| and v in reverse_index | ||
| and k != reverse_index[v][0] | ||
| ) | ||
| } | ||
| # ── 雙軌去重 end ─────────────────────────────────────────────── | ||
|
|
||
| if pending_en: | ||
| dst_en = pending_root_p / rel_group / "en_us.json" | ||
| write_json_fn(dst_en, pending_en) |
There was a problem hiding this comment.
Dedup reverse index compares English values against Traditional Chinese values — always a no-op
final_tw_lookup is loaded from zh_tw.json files. In this project's translation pipeline, zh_tw.json files contain Traditional Chinese text as values (the translation output). The reverse_index is therefore a {TW_value → [keys]} mapping.
However, the dedup filter then checks v in reverse_index where v is sourced from pending_en — an English text. Since English text will almost never equal a Traditional Chinese translation, the condition v in reverse_index will virtually never be True. The entire dedup block is effectively dead code.
The comment describes the intent as "若某英文文字(value)已出現在 final/zh_tw.json" — if that intent is to skip English source texts that already have an existing translation, the index needs to be built from en_us.json files (or from the keys of zh_tw.json mapped to their corresponding source English text):
# Build reverse index: {en_text → key} from already-translated pairs
reverse_index: dict[str, str] = {}
for en_file in final_root_p.rglob("en_us.json"):
en_data = read_json_dict_fn(en_file)
if en_data:
for k, en_v in en_data.items():
if is_filled_text_impl(en_v):
reverse_index.setdefault(en_v, []).append(k)Or alternatively, if the intent is simply to deduplicate by English value within pending_en itself, no lookup against final_tw_lookup is needed at all.
問題: - reverse_index[v][0] 取字典第一個 key,依賴 rglob 迭代順序 - rglob 在不同環境/執行之間迭代順序可能不同 - 導致相同英文文字的去重判斷結果不穩定 修復: - reverse_index 類型從 dict[str, list[str]] 改為 dict[str, str](直接存確定的 key) - 選擇策略: 1. 優先取「已翻譯的 key」(zh_tw 值與 key 名不同,表示有真正翻譯) 2. 同優先級則取字母序最小者(確定性 tiebreaker) 3. 無已翻譯時,取字母序最小者 - 消除 [0] 的非確定依賴
問題:雙軌去重的 filter 條件 k != reverse_index[v] 中: - k 來自 raw/pending 的 key(英文原文 key) - reverse_index[v] 來自 final/zh_tw 的 key(同樣是英文原文 key) 兩者來自不同命名空間,直接比對 key 幾乎不會成立,導致去重邏輯形同 no-op。 修復:移除 k != reverse_index[v] 判斷,直接以 v in reverse_index 來決定 是否跳過。若同一個翻譯結果 v 已出現在 final(即 v in reverse_index), 就視為已處理,直接跳過不送 pending。 Note: reverse_index 的非確定性問題(同一英文文字對應多個 key 時 rglob 迭代順序不穩定)在 f3a5814 中已透過 rev_candidates + sorted tiebreaker 修復,本次僅處理 cross-namespace 比對失效問題。
…dict prompt - lm_response_parser.py: 改用 brace-counting parser 取代 non-greedy regex,正確處理巢狀 JSON - lm_translator_main.py: System Prompt dict 支援 content/text key 萃取 - 新增測試:test_cache_manager, test_cache_store, test_ftbquests_unshield_logic, test_kubejs_translator_clean, test_lm_api_client, test_lm_response_parser, test_lm_translator_main_prompts
…ty 順序 - tests/test_lm_translator_main.py: 移除 test_patchouli_prompt_dict_converted_to_string(module-level 常數無法被 mock,修復後與 test_lm_translator_main_prompts.py 重疊) - cache_manager.py: clear_dirty() 移至 _save_entries_to_active_shards() 成功後執行,確保 crash 時 dirty flag 正確
- Issue #7: ftbquests + kubejs 雙重 JSON 讀取 → src_mapping_cache 緩存 - Issue #8: ftbquests callbacks 改為工廠函式 make_on_translated_item/batch_flushed/progress - Issue #7b: kubejs 同樣加入 src_mapping_cache - kubejs_translator_clean.py: reverse_index 確定性 + cross-namespace 去重 - 確認 unshield_text 傳入 .shields 而非整個 ShieldedText
…ld.shields) - ftbquests_lmtranslator.py: src_mapping_cache 緩存 JSON mapping,callback 工廠函式 - kubejs_tooltip_lmtranslator.py: src_mapping_cache 緩存,確認 unshield.shields - 驗證:1161 tests passed
7a8351a to
c2a1a60
Compare
Code Review 全量修復 PR
修復內容
本 PR 修復
CODE_REVIEW_REPORT_20260325_BY_PATH.md中記錄的所有 28 個問題。CRITICAL(需立即合併)
ftbquests_lmtranslator.pyunshield_text()傳入ShieldedText而非.shieldspatch_md_lmtranslator.pylm_api_client.pylm_translator_main.pyHIGH
_task_runner.pysession.finish()在 exception path 未呼叫lm_service.py/extract_service.pyflush()cache_manager.pyclear_dirty()在寫入前執行ftbquests_lmtranslator.py/kubejs_tooltip_lmtranslator.pyftbquests_lmtranslator.pylm_config_rules.pylm_translator_shared_recording.pymd_lmtranslator.pylm_response_parser.pyMEDIUM + LOW
所有 MEDIUM (#13-#23) 及 LOW (#24-#27) 問題均已修復。
驗證
python -m py_compile語法檢查.bak備份