Skip to content

feat(kubejs): 實作雙軌 reverse_index 去重#40

Open
jlin53882 wants to merge 1 commit intomainfrom
pr/dual-track-dedup
Open

feat(kubejs): 實作雙軌 reverse_index 去重#40
jlin53882 wants to merge 1 commit intomainfrom
pr/dual-track-dedup

Conversation

@jlin53882
Copy link
Copy Markdown
Owner

實作摘要

改了什麼

  • 在 kubejs_translator_clean.py 新增雙軌
    everse_index 去重邏輯
  • 使用 defaultdict(list) 建立 reverse mapping,避免重複翻譯

為什麼要改

  • 解決 KubeJS 翻譯時相同內容被重複處理的問題
  • 提升翻譯管線效率

怎麼驗證的

  • 單元測試驗證去重邏輯正確
  • 本地翻譯測試確認輸出正確

檔案變更

  • ranslation_tool/core/kubejs_translator_clean.py (+31 行)

- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup
- 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending
- 避免同一英文原文因不同 key 而重複翻譯
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR adds a "dual-track reverse-index dedup" block to kubejs_translator_clean.py with the goal of preventing the same English source text from being re-queued for translation when an equivalent entry already exists in the final/zh_tw.json output. While the intent is sound, the implementation has three significant correctness and performance issues that should be addressed before merging.

Key changes:

  • A new final_tw_lookup dict is built by reading all zh_tw.json files under final_root_p
  • A reverse_index (zh_tw_value → [keys]) is derived from that lookup
  • A dict-comprehension filter is applied to pending_en, removing entries whose English value already appears in reverse_index

Issues found:

  • Performance regression: final_tw_lookup and reverse_index are rebuilt from disk on every iteration of the for group_dir, files_map in groups.items() loop, causing O(N × files) filesystem reads. Both structures are independent of the current group and should be computed once before the loop, consistent with how tw_lookup is handled earlier in the function.
  • Non-deterministic filtering: The dedup check k != reverse_index[v][0] relies on the first key at index [0], whose position depends on rglob() traversal order (not stable across filesystems). If k happens to be at index [1], it will be incorrectly dropped; the membership check should use k not in reverse_index[v].
  • Semantic mismatch: The comment states the reverse index maps "英文文字 (English text) → keys", but final_tw_lookup is loaded from zh_tw.json files whose values are Chinese translations. The dedup is therefore a near-no-op for any entry that is properly translated, since an English string from pending_en will almost never match a Chinese value in the index. The PR description's stated goal of "avoiding duplicate translation of the same English content" is not achieved for the common case.

Confidence Score: 2/5

  • This PR introduces a dedup mechanism that is largely a no-op due to a semantic mismatch, and has a performance regression and non-deterministic filtering that could silently drop valid translation entries.
  • Three issues combine to lower confidence significantly: (1) the core semantic error means zh_tw Chinese values are compared against English strings, so the feature rarely works as intended; (2) rebuilding lookups on every loop iteration introduces a measurable performance regression for large modpacks; and (3) the [0]-index-based dedup is non-deterministic and can incorrectly filter valid entries depending on filesystem traversal order. None of these are data-loss catastrophic, but the combination of a near-no-op feature, a performance regression, and non-deterministic behavior warrants revision.
  • translation_tool/core/kubejs_translator_clean.py — specifically the new dedup block at lines 206–235

Important Files Changed

Filename Overview
translation_tool/core/kubejs_translator_clean.py Adds a reverse-index dedup block inside the per-group loop. Contains three notable issues: (1) final_tw_lookup and reverse_index are rebuilt from disk on every loop iteration instead of once before the loop, causing O(N×files) I/O; (2) reverse_index[v][0] picks the "canonical" key based on non-deterministic filesystem traversal order, potentially causing different entries to be silently dropped on each run; (3) the index is built from zh_tw values (Chinese), not English originals, so the dedup is a near-no-op for fully-translated content and contradicts the block comment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start group iteration] --> B{pending_en\nnot empty?}
    B -- No --> G[Skip dedup]
    B -- Yes --> C{final_root_p\nexists?}
    C -- No --> G
    C -- Yes --> D["Read ALL zh_tw.json files\nfrom final_root_p\n⚠️ Done on EVERY iteration"]
    D --> E{final_tw_lookup\nnot empty?}
    E -- No --> G
    E -- Yes --> F["Build reverse_index\nzh_tw_value → list of keys\n⚠️ Values are Chinese, not English"]
    F --> H["Filter pending_en:\nremove k if:\n  v in reverse_index\n  AND k != reverse_index[v][0]\n⚠️ [0] is non-deterministic order"]
    H --> I[Updated pending_en\nalmost unchanged for\ntranslated entries]
    G --> J[Write pending en_us.json\nif pending_en not empty]
    I --> J
Loading

Reviews (1): Last reviewed commit: "feat(kubejs_translator_clean): 新增雙軌 reve..." | Re-trigger Greptile

Comment on lines +210 to +234
if pending_en and final_root_p.exists():
# 從 final/zh_tw.json 建立 final_tw_lookup(key → 原文)
final_tw_lookup: dict[str, str] = {}
for tw_file in final_root_p.rglob("zh_tw.json"):
tw_data = read_json_dict_fn(tw_file)
if tw_data:
final_tw_lookup.update(tw_data)

if final_tw_lookup:
# 建立 reverse_index(英文文字 → 對應 key 列表)
reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)

# 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k != reverse_index[v][0]
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 final_tw_lookup rebuilt on every group iteration

final_tw_lookup (and the derived reverse_index) are constructed by iterating over all zh_tw.json files under final_root_p inside the for group_dir, files_map in groups.items() loop. For a modpack with N groups, this means all final files are read N times from disk — an O(N×files) I/O cost that should be O(files).

Both structures are purely derived from final_root_p and do not depend on the current group, so they should be computed once before the loop begins, similar to how tw_lookup is built at lines 123–137.

# Build once, before the loop
final_tw_lookup: dict[str, str] = {}
if final_root_p.exists():
    for tw_file in final_root_p.rglob("zh_tw.json"):
        tw_data = read_json_dict_fn(tw_file)
        if tw_data:
            final_tw_lookup.update(tw_data)

reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
    if is_filled_text_impl(v):
        reverse_index.setdefault(v, []).append(k)

# Then inside the loop, use `reverse_index` directly

Comment on lines +226 to +234
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k != reverse_index[v][0]
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-deterministic dedup via reverse_index[v][0]

The condition k != reverse_index[v][0] picks the first key that was inserted for value v — but final_root_p.rglob() returns files in filesystem-traversal order, which is not guaranteed to be stable across OS/filesystem types. If the same value maps to multiple keys (e.g., ["mod1:foo", "mod2:foo"]), the "representative" key at index [0] may differ between runs, making the dedup non-deterministic.

Additionally, when the current k does appear somewhere in reverse_index[v] (just not at position 0), the condition k != reverse_index[v][0] evaluates to True and incorrectly drops k from pending_en.

The intent appears to be: "keep the entry if the exact same key already exists in the final directory (meaning it doesn't need re-translation)." That is better expressed as membership in the full list:

pending_en = {
    k: v
    for k, v in pending_en.items()
    if not (
        is_filled_text_impl(v)
        and v in reverse_index
        and k not in reverse_index[v]   # check full list, not just [0]
    )
}

Comment on lines +206 to +223
# ── 雙軌去重(reverse_index dedup)───────────────────────────────
# 目的:若某英文文字(value)已出現在 final/zh_tw.json(不同 key),
# 表示該英文原文已有翻譯,不需要再送 pending。
# 建立 reverse_index:{英文文字: [key1, key2, ...]}
if pending_en and final_root_p.exists():
# 從 final/zh_tw.json 建立 final_tw_lookup(key → 原文)
final_tw_lookup: dict[str, str] = {}
for tw_file in final_root_p.rglob("zh_tw.json"):
tw_data = read_json_dict_fn(tw_file)
if tw_data:
final_tw_lookup.update(tw_data)

if final_tw_lookup:
# 建立 reverse_index(英文文字 → 對應 key 列表)
reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 reverse_index values are zh_tw (Chinese) text, not English originals

The block comment states:

建立 reverse_index:{英文文字: [key1, key2, ...]}

However, final_tw_lookup is loaded exclusively from zh_tw.json files in the completed directory — meaning its values are Chinese Traditional translations, not English source strings. The reverse_index therefore maps zh_tw_value → [keys].

When the dedup check v in reverse_index runs, v is an English value from pending_en. A match only occurs if zh_tw.json happens to store an untranslated English passthrough value for some key. Entries with real Chinese translations will never match — making the dedup a no-op for the vast majority of already-translated content, contrary to what the description ("英文文字已存在於 final") implies.

If the actual intent is to compare English originals, a separate en_us.json lookup from the final directory is needed, or the function signature should be extended to carry the original English source for each already-translated key.

Comment on lines +220 to +223
reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 PR description says defaultdict(list) but code uses setdefault

The PR description states "使用 defaultdict(list) 建立 reverse mapping", but the actual implementation uses dict.setdefault(v, []).append(k) (a plain dict). While functionally equivalent, the discrepancy means the PR description is inaccurate. If defaultdict was intended (which is the more idiomatic Python approach for this pattern), the import and construction should be:

from collections import defaultdict
reverse_index: defaultdict[str, list[str]] = defaultdict(list)
for k, v in final_tw_lookup.items():
    if is_filled_text_impl(v):
        reverse_index[v].append(k)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant