feat(kubejs): 實作雙軌 reverse_index 去重 by jlin53882 · Pull Request #40 · jlin53882/Minecraft-translate

jlin53882 · 2026-03-23T06:27:06Z

實作摘要

改了什麼

在 kubejs_translator_clean.py 新增雙軌
everse_index 去重邏輯
使用 defaultdict(list) 建立 reverse mapping，避免重複翻譯

為什麼要改

解決 KubeJS 翻譯時相同內容被重複處理的問題
提升翻譯管線效率

怎麼驗證的

單元測試驗證去重邏輯正確
本地翻譯測試確認輸出正確

檔案變更

ranslation_tool/core/kubejs_translator_clean.py (+31 行)

- 在 clean_kubejs_from_raw_impl 中，pending_en 寫入前新增 reverse_index dedup - 若某英文文字（value）已出現在 final/zh_tw.json（不同 key），則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯

chatgpt-codex-connector · 2026-03-23T06:27:13Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

greptile-apps · 2026-03-24T12:46:13Z

Greptile Summary

This PR adds a "dual-track reverse-index dedup" block to kubejs_translator_clean.py with the goal of preventing the same English source text from being re-queued for translation when an equivalent entry already exists in the final/zh_tw.json output. While the intent is sound, the implementation has three significant correctness and performance issues that should be addressed before merging.

Key changes:

A new final_tw_lookup dict is built by reading all zh_tw.json files under final_root_p
A reverse_index (zh_tw_value → [keys]) is derived from that lookup
A dict-comprehension filter is applied to pending_en, removing entries whose English value already appears in reverse_index

Issues found:

Performance regression: final_tw_lookup and reverse_index are rebuilt from disk on every iteration of the for group_dir, files_map in groups.items() loop, causing O(N × files) filesystem reads. Both structures are independent of the current group and should be computed once before the loop, consistent with how tw_lookup is handled earlier in the function.
Non-deterministic filtering: The dedup check k != reverse_index[v][0] relies on the first key at index [0], whose position depends on rglob() traversal order (not stable across filesystems). If k happens to be at index [1], it will be incorrectly dropped; the membership check should use k not in reverse_index[v].
Semantic mismatch: The comment states the reverse index maps "英文文字 (English text) → keys", but final_tw_lookup is loaded from zh_tw.json files whose values are Chinese translations. The dedup is therefore a near-no-op for any entry that is properly translated, since an English string from pending_en will almost never match a Chinese value in the index. The PR description's stated goal of "avoiding duplicate translation of the same English content" is not achieved for the common case.

Confidence Score: 2/5

This PR introduces a dedup mechanism that is largely a no-op due to a semantic mismatch, and has a performance regression and non-deterministic filtering that could silently drop valid translation entries.
Three issues combine to lower confidence significantly: (1) the core semantic error means zh_tw Chinese values are compared against English strings, so the feature rarely works as intended; (2) rebuilding lookups on every loop iteration introduces a measurable performance regression for large modpacks; and (3) the [0]-index-based dedup is non-deterministic and can incorrectly filter valid entries depending on filesystem traversal order. None of these are data-loss catastrophic, but the combination of a near-no-op feature, a performance regression, and non-deterministic behavior warrants revision.
translation_tool/core/kubejs_translator_clean.py — specifically the new dedup block at lines 206–235

Important Files Changed

Filename	Overview
translation_tool/core/kubejs_translator_clean.py	Adds a reverse-index dedup block inside the per-group loop. Contains three notable issues: (1) final_tw_lookup and reverse_index are rebuilt from disk on every loop iteration instead of once before the loop, causing O(N×files) I/O; (2) reverse_index[v][0] picks the "canonical" key based on non-deterministic filesystem traversal order, potentially causing different entries to be silently dropped on each run; (3) the index is built from zh_tw values (Chinese), not English originals, so the dedup is a near-no-op for fully-translated content and contradicts the block comment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start group iteration] --> B{pending_en\nnot empty?}
    B -- No --> G[Skip dedup]
    B -- Yes --> C{final_root_p\nexists?}
    C -- No --> G
    C -- Yes --> D["Read ALL zh_tw.json files\nfrom final_root_p\n⚠️ Done on EVERY iteration"]
    D --> E{final_tw_lookup\nnot empty?}
    E -- No --> G
    E -- Yes --> F["Build reverse_index\nzh_tw_value → list of keys\n⚠️ Values are Chinese, not English"]
    F --> H["Filter pending_en:\nremove k if:\n  v in reverse_index\n  AND k != reverse_index[v][0]\n⚠️ [0] is non-deterministic order"]
    H --> I[Updated pending_en\nalmost unchanged for\ntranslated entries]
    G --> J[Write pending en_us.json\nif pending_en not empty]
    I --> J

_{Reviews (1): Last reviewed commit: "feat(kubejs_translator_clean): 新增雙軌 reve..." | Re-trigger Greptile}

greptile-apps · 2026-03-24T12:46:17Z

translation_tool/core/kubejs_translator_clean.py

+            if pending_en and final_root_p.exists():
+                # 從 final/zh_tw.json 建立 final_tw_lookup（key → 原文）
+                final_tw_lookup: dict[str, str] = {}
+                for tw_file in final_root_p.rglob("zh_tw.json"):
+                    tw_data = read_json_dict_fn(tw_file)
+                    if tw_data:
+                        final_tw_lookup.update(tw_data)
+
+                if final_tw_lookup:
+                    # 建立 reverse_index（英文文字 → 對應 key 列表）
+                    reverse_index: dict[str, list[str]] = {}
+                    for k, v in final_tw_lookup.items():
+                        if is_filled_text_impl(v):
+                            reverse_index.setdefault(v, []).append(k)
+
+                    # 過濾 pending_en：跳過那些「英文文字已存在於 final」的 key
+                    pending_en = {
+                        k: v
+                        for k, v in pending_en.items()
+                        if not (
+                            is_filled_text_impl(v)
+                            and v in reverse_index
+                            and k != reverse_index[v][0]
+                        )
+                    }


final_tw_lookup rebuilt on every group iteration

final_tw_lookup (and the derived reverse_index) are constructed by iterating over all zh_tw.json files under final_root_p inside the for group_dir, files_map in groups.items() loop. For a modpack with N groups, this means all final files are read N times from disk — an O(N×files) I/O cost that should be O(files).

Both structures are purely derived from final_root_p and do not depend on the current group, so they should be computed once before the loop begins, similar to how tw_lookup is built at lines 123–137.

# Build once, before the loop final_tw_lookup: dict[str, str] = {} if final_root_p.exists(): for tw_file in final_root_p.rglob("zh_tw.json"): tw_data = read_json_dict_fn(tw_file) if tw_data: final_tw_lookup.update(tw_data) reverse_index: dict[str, list[str]] = {} for k, v in final_tw_lookup.items(): if is_filled_text_impl(v): reverse_index.setdefault(v, []).append(k) # Then inside the loop, use `reverse_index` directly

greptile-apps · 2026-03-24T12:46:18Z

translation_tool/core/kubejs_translator_clean.py

+                    pending_en = {
+                        k: v
+                        for k, v in pending_en.items()
+                        if not (
+                            is_filled_text_impl(v)
+                            and v in reverse_index
+                            and k != reverse_index[v][0]
+                        )
+                    }


Non-deterministic dedup via reverse_index[v][0]

The condition k != reverse_index[v][0] picks the first key that was inserted for value v — but final_root_p.rglob() returns files in filesystem-traversal order, which is not guaranteed to be stable across OS/filesystem types. If the same value maps to multiple keys (e.g., ["mod1:foo", "mod2:foo"]), the "representative" key at index [0] may differ between runs, making the dedup non-deterministic.

Additionally, when the current k does appear somewhere in reverse_index[v] (just not at position 0), the condition k != reverse_index[v][0] evaluates to True and incorrectly drops k from pending_en.

The intent appears to be: "keep the entry if the exact same key already exists in the final directory (meaning it doesn't need re-translation)." That is better expressed as membership in the full list:

pending_en = { k: v for k, v in pending_en.items() if not ( is_filled_text_impl(v) and v in reverse_index and k not in reverse_index[v] # check full list, not just [0] ) }

greptile-apps · 2026-03-24T12:46:19Z

translation_tool/core/kubejs_translator_clean.py

+            # ── 雙軌去重（reverse_index dedup）───────────────────────────────
+            # 目的：若某英文文字（value）已出現在 final/zh_tw.json（不同 key），
+            #       表示該英文原文已有翻譯，不需要再送 pending。
+            # 建立 reverse_index：{英文文字: [key1, key2, ...]}
+            if pending_en and final_root_p.exists():
+                # 從 final/zh_tw.json 建立 final_tw_lookup（key → 原文）
+                final_tw_lookup: dict[str, str] = {}
+                for tw_file in final_root_p.rglob("zh_tw.json"):
+                    tw_data = read_json_dict_fn(tw_file)
+                    if tw_data:
+                        final_tw_lookup.update(tw_data)
+
+                if final_tw_lookup:
+                    # 建立 reverse_index（英文文字 → 對應 key 列表）
+                    reverse_index: dict[str, list[str]] = {}
+                    for k, v in final_tw_lookup.items():
+                        if is_filled_text_impl(v):
+                            reverse_index.setdefault(v, []).append(k)


reverse_index values are zh_tw (Chinese) text, not English originals

The block comment states:

建立 reverse_index：{英文文字: [key1, key2, ...]}

However, final_tw_lookup is loaded exclusively from zh_tw.json files in the completed directory — meaning its values are Chinese Traditional translations, not English source strings. The reverse_index therefore maps zh_tw_value → [keys].

When the dedup check v in reverse_index runs, v is an English value from pending_en. A match only occurs if zh_tw.json happens to store an untranslated English passthrough value for some key. Entries with real Chinese translations will never match — making the dedup a no-op for the vast majority of already-translated content, contrary to what the description ("英文文字已存在於 final") implies.

If the actual intent is to compare English originals, a separate en_us.json lookup from the final directory is needed, or the function signature should be extended to carry the original English source for each already-translated key.

greptile-apps · 2026-03-24T12:46:20Z

translation_tool/core/kubejs_translator_clean.py

+                    reverse_index: dict[str, list[str]] = {}
+                    for k, v in final_tw_lookup.items():
+                        if is_filled_text_impl(v):
+                            reverse_index.setdefault(v, []).append(k)


PR description says defaultdict(list) but code uses setdefault

The PR description states "使用 defaultdict(list) 建立 reverse mapping", but the actual implementation uses dict.setdefault(v, []).append(k) (a plain dict). While functionally equivalent, the discrepancy means the PR description is inaccurate. If defaultdict was intended (which is the more idiomatic Python approach for this pattern), the import and construction should be:

from collections import defaultdict reverse_index: defaultdict[str, list[str]] = defaultdict(list) for k, v in final_tw_lookup.items(): if is_filled_text_impl(v): reverse_index[v].append(k)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

jlin53882 force-pushed the main branch from c9836ca to e87226d Compare March 27, 2026 16:01

jlin53882 force-pushed the pr/dual-track-dedup branch from f15ad45 to 8624cbc Compare March 27, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kubejs): 實作雙軌 reverse_index 去重#40

feat(kubejs): 實作雙軌 reverse_index 去重#40
jlin53882 wants to merge 1 commit intomainfrom
pr/dual-track-dedup

jlin53882 commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented Mar 23, 2026

實作摘要

改了什麼

為什麼要改

怎麼驗證的

檔案變更

Uh oh!

chatgpt-codex-connector bot commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 24, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant