Skip to content

fix: Code Review 全量修復 (28 issues, 2026-03-25)#43

Open
jlin53882 wants to merge 27 commits intomainfrom
fix/all-code-review-issues
Open

fix: Code Review 全量修復 (28 issues, 2026-03-25)#43
jlin53882 wants to merge 27 commits intomainfrom
fix/all-code-review-issues

Conversation

@jlin53882
Copy link
Copy Markdown
Owner

Code Review 全量修復 PR

修復內容

本 PR 修復 CODE_REVIEW_REPORT_20260325_BY_PATH.md 中記錄的所有 28 個問題。

CRITICAL(需立即合併)

Issue 檔案 問題
#1 ftbquests_lmtranslator.py unshield_text() 傳入 ShieldedText 而非 .shields
#1b patch_md_lmtranslator.py 同上
#2 lm_api_client.py API Key 直接暴露在 URL(安全漏洞)
#3 lm_translator_main.py Dict 作為 System Prompt 傳入 API

HIGH

Issue 檔案 問題
#4 _task_runner.py session.finish() 在 exception path 未呼叫
#5 lm_service.py / extract_service.py 錯誤路徑未呼叫 flush()
#6 cache_manager.py clear_dirty() 在寫入前執行
#7/#7b ftbquests_lmtranslator.py / kubejs_tooltip_lmtranslator.py 雙重 JSON 讀取效能問題
#8 ftbquests_lmtranslator.py Callbacks 在迴圈內定義
#9 lm_config_rules.py API Key 驗證過寬
#10 lm_translator_shared_recording.py CSV injection 漏洞
#11 md_lmtranslator.py Re-shielding 邏輯錯誤
#12 lm_response_parser.py Greedy regex 匹配無效 JSON

MEDIUM + LOW

所有 MEDIUM (#13-#23) 及 LOW (#24-#27) 問題均已修復。

驗證

  • 所有修改檔案已通過 python -m py_compile 語法檢查
  • 修改前皆有建立 .bak 備份

- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup
- 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending
- 避免同一英文原文因不同 key 而重複翻譯
- 新增 ColorCharError dataclass,記錄非法顏色字元錯誤
- 實作 COLOR_PATTERN 正則:& 後只能接 a-v(不含 w)、0-9、空格、\、#
- check_color_chars():檢查單一字串中的非法顏色字元
- check_json_file():讀取 JSON 並遞迴檢查所有字串值
- check_directory():遞迴檢查目錄下所有 .json 檔
- 遵循現有 checkers Generator yield 模式
…ipelines

Commit 1: feat(rich-text-shield) - add core module
- New module: translation_tool/plugins/shared/rich_text_shield.py
  - ShieldPiece / ShieldedText dataclasses
  - shield_text(): 抽出7種不應翻譯的格式片段(彩色碼/物品ID/URL/逸出\&/圖片/事件JSON/翻頁)
  - unshield_text(): 還原所有佔位符
  - add_escape_quotes(): JSON逸出修補(移植自FTBQL)
- Updated shared/__init__.py 導出新模組

Commit 2: fix(kubejs-lm) - integrate shield/unshield into LM translation pipeline
- collect_items_from_mapping(): shield_text() 寫入 _shielded,skip_reason 非None時保留原文
- on_translated_item(): unshield_text() 還原翻譯後文字

Commit 3: fix(kubejs-clean) - integrate shield into s2t pipeline (Phase 2)
- kubejs_translator_clean.py: _shielded_convert() helper,保護 safe_convert_text_fn
  呼叫點:deep_merge_3way_flat_impl() 和 client_scripts 處理
- 避免 OpenCC s2t 轉換時破壞 KubeJS 格式標記
- md_extract_qa: shield_text before writing pending JSON
- md_inject_qa: unshield_text before writing back to MD
- Item dataclass: add _shields field for shield restoration
- _shield_item helper for clean shield/unshield roundtrip
prevent ShieldedText object from leaking into json.dumps()
- Previously session.finish() was only called in the success path
- Now it's in the finally block to ensure it always executes
- This prevents session leaks when exceptions occur
- 移除 if save_path.exists() 檢查,改用 try-except 捕捉 FileNotFoundError
- 避免 TOCTOU (Time-of-Check to Time-of-Use) 問題:檢查與讀取之間檔案可能被刪除
- lm_service.py: added flush() after session.set_error() in exception handler
- extract_service.py: added flush() in both run_lang_extraction_service and run_book_extraction_service exception handlers
- This ensures buffered logs are flushed even when exceptions occur
- 新增 _counter_lock (threading.Lock) 保護模組層級計數器
- _next_color_placeholder / _next_item_placeholder / _next_escaped_placeholder
  的讀取-遞增-寫入操作以鎖保護,避免 race condition
- RE_LANG_SEG 只定義一次(在第 61 行)
- 移除第 87 行的重複定義,保留單一宣告
…s shielded_src.shields

The unshield_text() function expects shields (list[ShieldPiece]), not the
whole ShieldedText object. Fixed line 78 to pass shielded_src.shields.

Issue: #1b
The cache hit path was missing unshield_text() call while the cache miss
path properly called unshield_text(). Added shield handling to cache hit
path to unify both code paths.

Issue: #18
Added explicit check for empty shard files (0 bytes) in load_shard_file().
Previously empty files would cause JSONDecodeError and be logged as generic
failure. Now they are specifically detected and logged as empty shard warnings.

Issue: #19
… add_to_cache

Added add_to_cache_batch() function that takes multiple entries and acquires
the lock once for all entries, rather than acquiring/releasing for each entry.
This reduces lock contention when many cache entries are added rapidly.

Issue: #23
#13 - except Exception: pass 改為 logger.warning 記錄錯誤
- ftbquests_lmtranslator.py: 7處
- kubejs_tooltip_lmtranslator.py: 6處
- md_lmtranslator.py: 2處
- lm_translator_shared_loop.py: 5處
- cache_loader.py: 1處
- cache_search.py: 4處

#14 - cache_manager.py 初始化鎖定
- reload_translation_cache() now holds lock during reset+reload

#15 - cache_shards.py TOCTOU Race 防護
- _save_entries_to_active_shards() 加入檔案鎖保護讀取-修改-寫入循環

#16 - cache_shards.py fsync 已存在(確認無需修改)

#17 - md_lmtranslator.py skip_reason 處理
- 新增 skip_skipped 計數器
- skip_reason 非 None 的項目直接視為 cache hit
…t lock

- lm_api_client.py: move API key from URL query to Authorization Bearer header (security)
- lm_translator_main.py: convert dict system prompt to string before API call
- cache_manager.py: add cache_lock to initialize_translation_cache (race condition fix)
- Issue #7: ftbquests_lmtranslator.py - Cache JSON mappings to avoid duplicate reads
- Issue #7b: kubejs_tooltip_lmtranslator.py - Same fix for duplicate JSON reads
- Issue #8: ftbquests_lmtranslator.py - Move callbacks outside loop (factory functions)
- Issue #11: md_lmtranslator.py - Fix re-shielding logic (remove invalid else branch)
- Issue #12: lm_response_parser.py - Use non-greedy regex to avoid matching invalid JSON

# Conflicts:
#	translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py
#	translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py
#	translation_tool/plugins/md/md_lmtranslator.py
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR claims to fix 28 issues identified in a code review report, spanning critical security fixes, logic corrections, and a broad shield/unshield rich-text protection layer integrated into the KubeJS, FTBQuests, and Markdown translation pipelines. While many lower-priority fixes (exception logging, finally block cleanup, CSV injection protection, system prompt dict→string defaults) are correctly implemented, several of the critical and high-priority fixes remain broken or introduce new regressions.

Issues from previous review threads that remain unresolved:

  • lm_api_client.pyAuthorization: Bearer {api_key} still breaks all Gemini API calls; the correct header is x-goog-api-key
  • lm_response_parser.py — Non-greedy regex \{[\s\S]*?\} still fails on nested JSON (stops at the first })
  • cache_shards.pyimport msvcrt (Windows-only) added in four places with no cross-platform fallback, breaking the entire cache on Linux/macOS
  • ftbquests_lmtranslator.py — Shield/unshield in on_translated_item is still a silent no-op: shield_text(src_text) is called after translation, generating fresh placeholders that were never present in the text sent to the LLM

New issues introduced by this PR:

  • lm_config_rules.py — Regex [_-a-zA-Z0-9] creates an ASCII range _-a (95–97), unintentionally allowing backtick and silently excluding dash despite the error message claiming dash is valid
  • kubejs_translator_clean.py — New dedup block builds reverse_index from zh_tw.json values (Traditional Chinese text) but looks up English source values against it; the comparison almost never matches, making the dedup dead code
  • rich_text_shield.py — Global module-level counters produce non-deterministic placeholder IDs across runs; if processing order changes, source_text cache keys in md_lmtranslator.py will differ between runs causing unnecessary cache misses
  • md_lmtranslator.py — The else branch in on_translated_item calls shield_text() on already-shielded text, producing empty shields and a silent no-op

Confidence Score: 1/5

  • Not safe to merge — four critical bugs from the previous review remain unresolved, including a broken Gemini auth header that silently fails all API calls and a Windows-only import that crashes the entire cache system on Linux/macOS.
  • Three of the originally flagged CRITICAL issues (wrong auth header, non-greedy regex, msvcrt cross-platform) are still broken and will cause immediate runtime failures. The FTBQuests shield no-op also remains. This PR additionally introduces two new P1 logic bugs (regex character range and dedup semantic mismatch). While the lower-priority improvements are genuine progress, the unresolved regressions make the PR unsafe to merge.
  • lm_api_client.py (broken auth), lm_response_parser.py (broken JSON regex), cache_shards.py (Windows-only import), ftbquests_lmtranslator.py (shield no-op), lm_config_rules.py (regex range bug), kubejs_translator_clean.py (dedup dead code)

Important Files Changed

Filename Overview
translation_tool/core/lm_api_client.py Attempted fix for API key in URL changes to Authorization: Bearer {api_key}, but the Gemini REST API requires x-goog-api-key header for static keys — Bearer is for OAuth2 access tokens. All API calls will return HTTP 401. This was flagged in a previous review thread and remains unresolved.
translation_tool/core/lm_response_parser.py Changed greedy regex \{[\s\S]*\} to non-greedy \{[\s\S]*?\} — this breaks all nested JSON objects (stops at the first } rather than finding the outermost one). Flagged in previous review thread and remains unresolved.
translation_tool/utils/cache_shards.py import msvcrt (Windows-only) added in four places: unused in _write_json_atomic, and used for file locking in _rotate_shard_if_needed and _save_entries_to_active_shards. All four will raise ModuleNotFoundError on Linux/macOS. Flagged in previous review thread and remains unresolved.
translation_tool/plugins/ftbquests/ftbquests_lmtranslator.py on_translated_item calls shield_text(src_text) AFTER translation, generating fresh placeholders that were never injected into the text sent to the LLM — unshield_text is always a no-op. Flagged in a previous thread and remains unresolved. Other improvements (exception logging, cache recording) are correct.
translation_tool/core/lm_config_rules.py New API key validation adds length and character checks, but the regex [_-a-zA-Z0-9] has a character range bug: _-a creates an ASCII range 95–97, unintentionally allowing backtick and silently excluding the dash character that the error message claims is valid.
translation_tool/core/kubejs_translator_clean.py New _shielded_convert helper and dedup logic added. The _shielded_convert function is correct. However, the new reverse-index dedup block builds its index from zh_tw.json values (Traditional Chinese text) but then looks up English values from pending_en, so the check almost never fires and the dedup is effectively dead code.
translation_tool/plugins/shared/rich_text_shield.py New module implementing shield/unshield for KubeJS rich text. Pattern matching and placeholder substitution logic is sound. Main concern: global module-level counters make placeholder IDs non-deterministic across runs if processing order changes, which can cause cache key mismatches for MD blocks.
translation_tool/plugins/md/md_lmtranslator.py Shield/unshield integration added. on_translated_item primary path (using _shielded) is correct. The else branch that calls shield_text(src_text) on already-shielded text is a no-op and should be removed. Cache key stability depends on consistent processing order due to global counters in rich_text_shield.py.
translation_tool/plugins/kubejs/kubejs_tooltip_lmtranslator.py Shield integration in collect_items_from_mapping and on_translated_item follows the correct pattern: shield at collection time, store _shielded in item dict, unshield in both cache-hit and cache-miss callbacks. Correctly uses source_text (original) for cache key.
translation_tool/utils/cache_manager.py initialize_translation_cache now takes the lock before checking initialized, fixing a double-initialization race. reload_translation_cache refactored to inline the load logic; since reset_runtime_state mutates the existing state object in-place (confirmed by reading cache_store.py), the lock acquired on the old state remains valid throughout. New add_to_cache_batch helper reduces lock contention correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Source Text\n(e.g. 'Hello &a World')"] --> B["md_extract_qa.py\n_shield_item()"]
    B --> |"shield_text() → $C0$\nstores _shields in JSON"| C["Pending JSON\ntext: 'Hello $C0$ World'\n_shields: [$C0$→&a]"]
    C --> D["md_lmtranslator.py\nhash_to_src[hash] = it.text"]
    D --> E["shield_text(already-shielded)\n⚠️ No patterns match\nshields = []  ← REDUNDANT"]
    E --> F["LLM receives\n'Hello $C0$ World'"]
    F --> G["LLM returns\n'你好 $C0$ 世界'"]
    G --> H{"_shielded.shields\nempty?"}
    H --> |"Yes (always)\nelse branch"| I["shield_text(src_text)\n⚠️ No-op — fresh\nplaceholders not in dst"]
    H --> |"No (never reached)"| J["unshield_text(dst, shields)"]
    I --> K["hash_to_dst[h] = '你好 $C0$ 世界'"]
    K --> L["md_inject_qa.py\napply_item_to_md_lines()"]
    L --> M["unshield_text(item.text, item._shields)\n✅ $C0$ → &a\n= '你好 &a 世界'"]

    style E fill:#ffe0b2,stroke:#e65100
    style I fill:#ffe0b2,stroke:#e65100
    style H fill:#ffe0b2,stroke:#e65100
Loading

Comments Outside Diff (4)

  1. translation_tool/core/lm_config_rules.py, line 303 (link)

    P1 Regex character range allows backtick and silently excludes dash

    The character class [_-a-zA-Z0-9] contains the expression _-a, which creates an ASCII range from _ (95) to a (97). This unintentionally includes the backtick character ` (96), which should be invalid in a Gemini API key.

    Additionally, despite the error message explicitly stating that dashes (-) are allowed ("僅允許 'AIza' 開頭後接英文字母、數字、 dash(-) 或 underscore(_)"), the current regex does not actually allow the dash character (ASCII 45 is outside all the ranges). Keys containing - will be incorrectly rejected.

    The same bug appears at line 332 in validate_api_keys_from_ui.

  2. translation_tool/core/lm_config_rules.py, line 332 (link)

    P1 Same regex bug as line 303 — validate_api_keys_from_ui

    The same [_-a-zA-Z0-9] character class issue is repeated here (backtick unintentionally allowed; dash silently rejected despite documentation claiming it is valid).

  3. translation_tool/plugins/shared/rich_text_shield.py, line 1150-1185 (link)

    P2 Global counters produce non-deterministic placeholders, potentially breaking cache key stability across runs

    _counter_color, _counter_item, and _counter_escaped are module-level globals that never reset between shield_text() calls (except explicitly via _reset_counters()). In a single translation session that processes N files, the placeholder IDs ($C0$, $C1$, ...) depend on the order in which texts are processed.

    In md_lmtranslator.py, the cache key is content_hash|source_text where source_text is the extract-step-shielded text (e.g. "Hello $C0$ World"). On the next run, if any other file is processed before this one, the global counter will have advanced, and the same original text produces "Hello $C5$ World" instead — a different source_text, so the cache lookup misses even though the text is identical.

    The safest fix is to generate placeholder IDs from a local counter per shield_text() call rather than from global state:

    def shield_text(text: str) -> ShieldedText:
        local_color = 0
        local_item = 0
        local_escaped = 0
        # use local_* instead of incrementing globals

    Alternatively, derive placeholder IDs deterministically from the original text (e.g., hash-based). This would make placeholders stable regardless of processing order.

  4. translation_tool/plugins/md/md_lmtranslator.py, line 956-961 (link)

    P2 else branch in on_translated_item is always a no-op and potentially misleading

    When execution reaches the else branch, _shielded is either None or has an empty shields list. In both cases, src_text here is the already-shielded text produced by md_extract_qa.py (e.g. "Hello $C0$ World"). Calling shield_text(src_text) on this already-shielded text will not match any shield pattern (since $C0$ does not look like &a, &#RRGGBB, or #namespace:item), so shielded_src.shields will always be empty, and unshield_text(dst, []) is a no-op.

    Rather than silently doing nothing with misleading log-on-exception coverage, this branch should be removed entirely. The correct unshielding for _shields from the extract step happens downstream in apply_item_to_md_lines via item._shields.

Reviews (2): Last reviewed commit: "fix: 修正 API key regex range operator + 測..." | Re-trigger Greptile


matches = re.findall(r"\{[\s\S]*\}", text)
# ✅ Issue #12 修復:使用 non-greedy regex 避免匹配無效的多重 JSON
matches = re.findall(r"\{[\s\S]*?\}", text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Non-greedy regex breaks nested JSON objects

The switch from greedy \{[\s\S]*\} to non-greedy \{[\s\S]*?\} introduces a regression for the fallback path. The non-greedy quantifier stops at the first } it finds, which for any nested JSON object will produce an incomplete, invalid fragment.

For example, given the text:

Here is my translation: {"key1": "value", "nested": {"k": "v"}} done
  • Greedy regex matches: {"key1": "value", "nested": {"k": "v"}} ✅ (valid JSON)
  • Non-greedy regex matches: {"key1": "value", "nested": {"k": "v"} ❌ (stops at inner }, missing outer })

Since json.loads then fails on that partial match, and re.findall has already consumed the position, no valid JSON is returned and the function throws RuntimeError("JSON 解析失敗").

LLM responses that embed translation dictionaries (which this project relies on) commonly use nested objects. The correct fix for the original issue is to keep the greedy match but additionally try the largest valid candidate first:

Suggested change
matches = re.findall(r"\{[\s\S]*?\}", text)
matches = re.findall(r"\{[\s\S]*\}", text)

使用 fsync 確保資料寫入磁碟,避免作業系統緩衝區未 flush
就執行 os.replace() 導致資料遺失。
"""
import msvcrt
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Windows-only msvcrt module used without cross-platform fallback

msvcrt is a Windows-only standard library module. Importing it on Linux or macOS raises ModuleNotFoundError: No module named 'msvcrt' at runtime, making the entire cache_shards.py module — and by extension the whole translation cache — unusable on non-Windows systems.

import msvcrt appears in four places in this file:

  • _write_json_atomic (line 25) — unused in that function (the actual call is os.fsync(), which is POSIX)
  • _rotate_shard_if_needed (line 90 inside try, line 113 inside finally)
  • _save_entries_to_active_shards (line 134)

For true cross-platform file locking the standard approach is to use fcntl.flock on POSIX and msvcrt.locking on Windows, with a runtime check:

import sys
if sys.platform == "win32":
    import msvcrt
    def _lock_fd(fd): msvcrt.locking(fd, msvcrt.LK_LOCK, 1)
    def _unlock_fd(fd): msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
else:
    import fcntl
    def _lock_fd(fd): fcntl.flock(fd, fcntl.LOCK_EX)
    def _unlock_fd(fd): fcntl.flock(fd, fcntl.LOCK_UN)

Additionally, the import msvcrt on line 25 (_write_json_atomic) serves no purpose since that function only calls os.fsync() and should be removed regardless.

Comment on lines +473 to +477
try:
shielded_src = shield_text(src_text)
t = unshield_text(t, shielded_src.shields)
except Exception as e:
log_info(f"[FTB-LM] unshield 失敗: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Shield/unshield is a no-op — FTB items are never shielded before translation

The on_translated_item callback calls shield_text(src_text) after the LLM has already translated t. This freshly-created ShieldedText contains brand-new globally-counted placeholders (e.g. $C42$, $P7$) that were never injected into the source text sent to the model, so they cannot exist in the translated result t. The subsequent unshield_text(t, shielded_src.shields) therefore always returns t unchanged — the entire shield operation is a silent no-op.

Compare this to the correct implementation in kubejs_tooltip_lmtranslator.py, where shielding happens during item collection in collect_items_from_mapping and the ShieldedText object is stored in the item dict under _shielded:

# kubejs — correct pattern:
shielded = shield_text(v)
item["text"] = shielded.clean       # shielded text sent to LM
item["_shielded"] = shielded        # stored for later unshield

# in on_translated_item:
shielded = it.get("_shielded")
if shielded and shielded.shields:
    final_text = unshield_text(t, shielded.shields)

To properly protect FTB source text, shielding needs to be applied in the item-collection helper (around line 96–110 in ftbquests_lmtranslator.py) so that item["text"] carries the cleaned text and item["_shielded"] carries the shield list — matching the KubeJS approach.

Comment on lines +29 to +32
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Wrong authentication header breaks all Gemini API calls

The Google Gemini REST API requires plain API keys to be passed via the x-goog-api-key header, as shown in the official API reference. The Authorization: Bearer pattern is the OAuth2 flow for short-lived access tokens, not static API keys — these are entirely different credential types.

Sending a static API key in an Authorization: Bearer header will produce HTTP 401 Unauthorized on every request, silently breaking all translation calls.

The previous URL query-string approach was functionally correct. The improvement here should be using x-goog-api-key as the header name (without any Bearer prefix), per the Gemini documentation.

- lm_config_rules.py: 將 [_-a-zA-Z0-9] 改為 [a-zA-Z0-9_-],避免 - 被解讀為 range operator
- test_lm_config_rules.py: 將假 key 改為 40 字合規格式
- test_pipeline_services_error_handling.py: 改為檢查 calls[-2] 預期 set_error
Comment on lines 246 to 263
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)

# 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k != reverse_index[v][0]
)
}
# ── 雙軌去重 end ───────────────────────────────────────────────

if pending_en:
dst_en = pending_root_p / rel_group / "en_us.json"
write_json_fn(dst_en, pending_en)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dedup reverse index compares English values against Traditional Chinese values — always a no-op

final_tw_lookup is loaded from zh_tw.json files. In this project's translation pipeline, zh_tw.json files contain Traditional Chinese text as values (the translation output). The reverse_index is therefore a {TW_value → [keys]} mapping.

However, the dedup filter then checks v in reverse_index where v is sourced from pending_en — an English text. Since English text will almost never equal a Traditional Chinese translation, the condition v in reverse_index will virtually never be True. The entire dedup block is effectively dead code.

The comment describes the intent as "若某英文文字(value)已出現在 final/zh_tw.json" — if that intent is to skip English source texts that already have an existing translation, the index needs to be built from en_us.json files (or from the keys of zh_tw.json mapped to their corresponding source English text):

# Build reverse index: {en_text → key} from already-translated pairs
reverse_index: dict[str, str] = {}
for en_file in final_root_p.rglob("en_us.json"):
    en_data = read_json_dict_fn(en_file)
    if en_data:
        for k, en_v in en_data.items():
            if is_filled_text_impl(en_v):
                reverse_index.setdefault(en_v, []).append(k)

Or alternatively, if the intent is simply to deduplicate by English value within pending_en itself, no lookup against final_tw_lookup is needed at all.

問題:
- reverse_index[v][0] 取字典第一個 key,依賴 rglob 迭代順序
- rglob 在不同環境/執行之間迭代順序可能不同
- 導致相同英文文字的去重判斷結果不穩定

修復:
- reverse_index 類型從 dict[str, list[str]] 改為 dict[str, str](直接存確定的 key)
- 選擇策略:
  1. 優先取「已翻譯的 key」(zh_tw 值與 key 名不同,表示有真正翻譯)
  2. 同優先級則取字母序最小者(確定性 tiebreaker)
  3. 無已翻譯時,取字母序最小者
- 消除 [0] 的非確定依賴
問題:雙軌去重的 filter 條件 k != reverse_index[v] 中:
- k 來自 raw/pending 的 key(英文原文 key)
- reverse_index[v] 來自 final/zh_tw 的 key(同樣是英文原文 key)

兩者來自不同命名空間,直接比對 key 幾乎不會成立,導致去重邏輯形同 no-op。

修復:移除 k != reverse_index[v] 判斷,直接以 v in reverse_index 來決定
是否跳過。若同一個翻譯結果 v 已出現在 final(即 v in reverse_index),
就視為已處理,直接跳過不送 pending。

Note: reverse_index 的非確定性問題(同一英文文字對應多個 key 時
rglob 迭代順序不穩定)在 f3a5814 中已透過 rev_candidates + sorted
tiebreaker 修復,本次僅處理 cross-namespace 比對失效問題。
…dict prompt

- lm_response_parser.py: 改用 brace-counting parser 取代 non-greedy regex,正確處理巢狀 JSON
- lm_translator_main.py: System Prompt dict 支援 content/text key 萃取
- 新增測試:test_cache_manager, test_cache_store, test_ftbquests_unshield_logic, test_kubejs_translator_clean, test_lm_api_client, test_lm_response_parser, test_lm_translator_main_prompts
…ty 順序

- tests/test_lm_translator_main.py: 移除 test_patchouli_prompt_dict_converted_to_string(module-level 常數無法被 mock,修復後與 test_lm_translator_main_prompts.py 重疊)
- cache_manager.py: clear_dirty() 移至 _save_entries_to_active_shards() 成功後執行,確保 crash 時 dirty flag 正確
- Issue #7: ftbquests + kubejs 雙重 JSON 讀取 → src_mapping_cache 緩存
- Issue #8: ftbquests callbacks 改為工廠函式 make_on_translated_item/batch_flushed/progress
- Issue #7b: kubejs 同樣加入 src_mapping_cache
- kubejs_translator_clean.py: reverse_index 確定性 + cross-namespace 去重
- 確認 unshield_text 傳入 .shields 而非整個 ShieldedText
…ld.shields)

- ftbquests_lmtranslator.py: src_mapping_cache 緩存 JSON mapping,callback 工廠函式
- kubejs_tooltip_lmtranslator.py: src_mapping_cache 緩存,確認 unshield.shields
- 驗證:1161 tests passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant