Conversation
- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup - 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Greptile SummaryThis PR adds a "dual-track reverse-index dedup" block to Key changes:
Issues found:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Start group iteration] --> B{pending_en\nnot empty?}
B -- No --> G[Skip dedup]
B -- Yes --> C{final_root_p\nexists?}
C -- No --> G
C -- Yes --> D["Read ALL zh_tw.json files\nfrom final_root_p\n⚠️ Done on EVERY iteration"]
D --> E{final_tw_lookup\nnot empty?}
E -- No --> G
E -- Yes --> F["Build reverse_index\nzh_tw_value → list of keys\n⚠️ Values are Chinese, not English"]
F --> H["Filter pending_en:\nremove k if:\n v in reverse_index\n AND k != reverse_index[v][0]\n⚠️ [0] is non-deterministic order"]
H --> I[Updated pending_en\nalmost unchanged for\ntranslated entries]
G --> J[Write pending en_us.json\nif pending_en not empty]
I --> J
Reviews (1): Last reviewed commit: "feat(kubejs_translator_clean): 新增雙軌 reve..." | Re-trigger Greptile |
| if pending_en and final_root_p.exists(): | ||
| # 從 final/zh_tw.json 建立 final_tw_lookup(key → 原文) | ||
| final_tw_lookup: dict[str, str] = {} | ||
| for tw_file in final_root_p.rglob("zh_tw.json"): | ||
| tw_data = read_json_dict_fn(tw_file) | ||
| if tw_data: | ||
| final_tw_lookup.update(tw_data) | ||
|
|
||
| if final_tw_lookup: | ||
| # 建立 reverse_index(英文文字 → 對應 key 列表) | ||
| reverse_index: dict[str, list[str]] = {} | ||
| for k, v in final_tw_lookup.items(): | ||
| if is_filled_text_impl(v): | ||
| reverse_index.setdefault(v, []).append(k) | ||
|
|
||
| # 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key | ||
| pending_en = { | ||
| k: v | ||
| for k, v in pending_en.items() | ||
| if not ( | ||
| is_filled_text_impl(v) | ||
| and v in reverse_index | ||
| and k != reverse_index[v][0] | ||
| ) | ||
| } |
There was a problem hiding this comment.
final_tw_lookup rebuilt on every group iteration
final_tw_lookup (and the derived reverse_index) are constructed by iterating over all zh_tw.json files under final_root_p inside the for group_dir, files_map in groups.items() loop. For a modpack with N groups, this means all final files are read N times from disk — an O(N×files) I/O cost that should be O(files).
Both structures are purely derived from final_root_p and do not depend on the current group, so they should be computed once before the loop begins, similar to how tw_lookup is built at lines 123–137.
# Build once, before the loop
final_tw_lookup: dict[str, str] = {}
if final_root_p.exists():
for tw_file in final_root_p.rglob("zh_tw.json"):
tw_data = read_json_dict_fn(tw_file)
if tw_data:
final_tw_lookup.update(tw_data)
reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)
# Then inside the loop, use `reverse_index` directly| pending_en = { | ||
| k: v | ||
| for k, v in pending_en.items() | ||
| if not ( | ||
| is_filled_text_impl(v) | ||
| and v in reverse_index | ||
| and k != reverse_index[v][0] | ||
| ) | ||
| } |
There was a problem hiding this comment.
Non-deterministic dedup via
reverse_index[v][0]
The condition k != reverse_index[v][0] picks the first key that was inserted for value v — but final_root_p.rglob() returns files in filesystem-traversal order, which is not guaranteed to be stable across OS/filesystem types. If the same value maps to multiple keys (e.g., ["mod1:foo", "mod2:foo"]), the "representative" key at index [0] may differ between runs, making the dedup non-deterministic.
Additionally, when the current k does appear somewhere in reverse_index[v] (just not at position 0), the condition k != reverse_index[v][0] evaluates to True and incorrectly drops k from pending_en.
The intent appears to be: "keep the entry if the exact same key already exists in the final directory (meaning it doesn't need re-translation)." That is better expressed as membership in the full list:
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k not in reverse_index[v] # check full list, not just [0]
)
}| # ── 雙軌去重(reverse_index dedup)─────────────────────────────── | ||
| # 目的:若某英文文字(value)已出現在 final/zh_tw.json(不同 key), | ||
| # 表示該英文原文已有翻譯,不需要再送 pending。 | ||
| # 建立 reverse_index:{英文文字: [key1, key2, ...]} | ||
| if pending_en and final_root_p.exists(): | ||
| # 從 final/zh_tw.json 建立 final_tw_lookup(key → 原文) | ||
| final_tw_lookup: dict[str, str] = {} | ||
| for tw_file in final_root_p.rglob("zh_tw.json"): | ||
| tw_data = read_json_dict_fn(tw_file) | ||
| if tw_data: | ||
| final_tw_lookup.update(tw_data) | ||
|
|
||
| if final_tw_lookup: | ||
| # 建立 reverse_index(英文文字 → 對應 key 列表) | ||
| reverse_index: dict[str, list[str]] = {} | ||
| for k, v in final_tw_lookup.items(): | ||
| if is_filled_text_impl(v): | ||
| reverse_index.setdefault(v, []).append(k) |
There was a problem hiding this comment.
reverse_index values are zh_tw (Chinese) text, not English originals
The block comment states:
建立 reverse_index:{英文文字: [key1, key2, ...]}
However, final_tw_lookup is loaded exclusively from zh_tw.json files in the completed directory — meaning its values are Chinese Traditional translations, not English source strings. The reverse_index therefore maps zh_tw_value → [keys].
When the dedup check v in reverse_index runs, v is an English value from pending_en. A match only occurs if zh_tw.json happens to store an untranslated English passthrough value for some key. Entries with real Chinese translations will never match — making the dedup a no-op for the vast majority of already-translated content, contrary to what the description ("英文文字已存在於 final") implies.
If the actual intent is to compare English originals, a separate en_us.json lookup from the final directory is needed, or the function signature should be extended to carry the original English source for each already-translated key.
| reverse_index: dict[str, list[str]] = {} | ||
| for k, v in final_tw_lookup.items(): | ||
| if is_filled_text_impl(v): | ||
| reverse_index.setdefault(v, []).append(k) |
There was a problem hiding this comment.
PR description says
defaultdict(list) but code uses setdefault
The PR description states "使用 defaultdict(list) 建立 reverse mapping", but the actual implementation uses dict.setdefault(v, []).append(k) (a plain dict). While functionally equivalent, the discrepancy means the PR description is inaccurate. If defaultdict was intended (which is the more idiomatic Python approach for this pattern), the import and construction should be:
from collections import defaultdict
reverse_index: defaultdict[str, list[str]] = defaultdict(list)
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index[v].append(k)Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
f15ad45 to
8624cbc
Compare
實作摘要
改了什麼
everse_index 去重邏輯
為什麼要改
怎麼驗證的
檔案變更