feat(checkers): 新增顏色字元校驗工具 by jlin53882 · Pull Request #41 · jlin53882/Minecraft-translate

jlin53882 · 2026-03-23T06:27:21Z

實作摘要

改了什麼

新增 color_char_checker.py 模組
提供 check_color_chars() 函式，檢查翻譯檔案中的非法顏色字元
定義 ColorCharError dataclass 記錄錯誤详情

為什麼要改

Minecraft 顏色代碼格式嚴謹，& 後只能接特定字元
提前發現非法顏色字元可避免翻譯輸出錯誤

怎麼驗證的

單元測試驗證合法/非法顏色字元判斷正確
測試涵蓋：標準彩色碼、十六進位顏色、逸出字元等

檔案變更

ranslation_tool/checkers/color_char_checker.py (+115 行)

- 在 clean_kubejs_from_raw_impl 中，pending_en 寫入前新增 reverse_index dedup - 若某英文文字（value）已出現在 final/zh_tw.json（不同 key），則跳過不寫入 pending - 避免同一英文原文因不同 key 而重複翻譯

- 新增 ColorCharError dataclass，記錄非法顏色字元錯誤 - 實作 COLOR_PATTERN 正則：& 後只能接 a-v（不含 w）、0-9、空格、\、# - check_color_chars()：檢查單一字串中的非法顏色字元 - check_json_file()：讀取 JSON 並遞迴檢查所有字串值 - check_directory()：遞迴檢查目錄下所有 .json 檔 - 遵循現有 checkers Generator yield 模式

…t color code)

greptile-apps · 2026-03-24T12:47:03Z

Greptile Summary

This PR introduces a new color_char_checker.py module to detect illegal Minecraft colour-code characters in translation files, and adds a reverse-index deduplication block inside kubejs_translator_clean.py to skip pending entries whose English source text already appears in the final output directory.

Key issues found:

COLOR_PATTERN whitelist too broad (color_char_checker.py line 16): the character class [^a-v0-9\s\\#] treats the entire range a–v as legal, so invalid codes like &g, &h, &j, &p, &q, &s, &t, &u, &v are silently accepted. The valid Minecraft set is much narrower (0-9, a-f, k-o, r, optionally x).
Performance regression (kubejs_translator_clean.py lines 210–234): final_tw_lookup and reverse_index are fully rebuilt on every iteration of the groups loop by re-scanning all zh_tw.json files under final_root_p. This should be computed once before the loop.
Cross-namespace key comparison (kubejs_translator_clean.py lines 225–234): the dedup filter uses k != reverse_index[v][0], where k is a key from the raw/pending source files and reverse_index[v][0] is a key from the final directory. Because these come from different file namespaces they rarely match, meaning any pending entry whose English value also appears anywhere in final_tw_lookup is unconditionally dropped — likely more aggressive than intended.
Minor style: check_color_chars returns None instead of [] on success (inconsistent return type), and nested-dict traversal discards the parent-key path from error messages.

Confidence Score: 2/5

This PR has two logic bugs that could cause silent data loss and a performance regression; not safe to merge without fixes.
The regex whitelist being too broad defeats the main purpose of the new checker module. The deduplication logic in kubejs_translator_clean.py can silently drop valid pending-translation entries due to a cross-namespace key comparison, which is a correctness bug. Additionally, rebuilding the full reverse-index on every group is a performance problem that will worsen as the translation repository grows.
Both changed files need attention: translation_tool/core/kubejs_translator_clean.py for the deduplication correctness and performance issues, and translation_tool/checkers/color_char_checker.py for the overly permissive regex.

Important Files Changed

Filename	Overview
translation_tool/checkers/color_char_checker.py	New checker module with a regex whitelist (`a-v`) that is significantly broader than the valid Minecraft color-code set, causing the checker to silently miss many invalid codes (e.g. `&g`, `&p`, `&v`). Minor issues: inconsistent `list
translation_tool/core/kubejs_translator_clean.py	New deduplication block rebuilds `final_tw_lookup` and `reverse_index` on every group iteration (O(N×files) scan). The dedup filter condition also compares keys from different namespaces (raw vs. final), which can silently over-drop pending entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([check_directory]) -->|os.walk .json files| B[check_json_file]
    B -->|json.load| C{root is dict?}
    C -- Yes --> D[_check_value per key]
    C -- No --> E([silently skipped])
    D -->|str| F[check_color_chars\nregex: &amp;&#91;^a-v0-9\\s\\\\#&#93;]
    F --> G{match found?}
    G -- Yes --> H[yield ColorCharError\nfile_path='', key='']
    G -- No --> I([return None])
    H --> J[_check_value patches\nfile_path & key]
    D -->|dict| K[recurse with child key\nparent key lost]
    D -->|list| L[recurse with key&#91;i&#93;\nparent path preserved]

    subgraph kubejs_translator_clean [kubejs_translator_clean — reverse_index dedup]
        M([for each group]) --> N{pending_en\n& final_root exists?}
        N -- Yes --> O[rglob zh_tw.json\n⚠️ repeated per group]
        O --> P[build final_tw_lookup]
        P --> Q[build reverse_index\nvalue → keys]
        Q --> R[filter pending_en:\ndrop if v in reverse_index\n& k != reverse_index&#91;v&#93;&#91;0&#93;\n⚠️ cross-namespace key compare]
        R --> S([write en_us.json])
    end

_{Reviews (1): Last reviewed commit: "style: apply ruff format" | Re-trigger Greptile}

greptile-apps · 2026-03-24T12:47:07Z

translation_tool/core/kubejs_translator_clean.py

+            if pending_en and final_root_p.exists():
+                # 從 final/zh_tw.json 建立 final_tw_lookup（key → 原文）
+                final_tw_lookup: dict[str, str] = {}
+                for tw_file in final_root_p.rglob("zh_tw.json"):
+                    tw_data = read_json_dict_fn(tw_file)
+                    if tw_data:
+                        final_tw_lookup.update(tw_data)
+
+                if final_tw_lookup:
+                    # 建立 reverse_index（英文文字 → 對應 key 列表）
+                    reverse_index: dict[str, list[str]] = {}
+                    for k, v in final_tw_lookup.items():
+                        if is_filled_text_impl(v):
+                            reverse_index.setdefault(v, []).append(k)
+
+                    # 過濾 pending_en：跳過那些「英文文字已存在於 final」的 key
+                    pending_en = {
+                        k: v
+                        for k, v in pending_en.items()
+                        if not (
+                            is_filled_text_impl(v)
+                            and v in reverse_index
+                            and k != reverse_index[v][0]
+                        )
+                    }


final_tw_lookup rebuilt on every group iteration

The entire final_root_p.rglob("zh_tw.json") scan and reverse_index construction happen inside the for group_dir, files_map in groups.items() loop (line 185). For N groups this becomes O(N × number-of-zh_tw-files) — all zh_tw.json files are read and merged from scratch on every group, even though the final directory does not change between iterations.

Move both final_tw_lookup and reverse_index outside the loop so they are built once:

# Build once before the loop final_tw_lookup: dict[str, str] = {} if final_root_p.exists(): for tw_file in final_root_p.rglob("zh_tw.json"): tw_data = read_json_dict_fn(tw_file) if tw_data: final_tw_lookup.update(tw_data) reverse_index: dict[str, list[str]] = {} for k, v in final_tw_lookup.items(): if is_filled_text_impl(v): reverse_index.setdefault(v, []).append(k) for group_dir, files_map in groups.items(): ... if pending_en and reverse_index: pending_en = { k: v for k, v in pending_en.items() if not ( is_filled_text_impl(v) and v in reverse_index and k != reverse_index[v][0] ) }

greptile-apps · 2026-03-24T12:47:08Z

translation_tool/core/kubejs_translator_clean.py

+                    # 過濾 pending_en：跳過那些「英文文字已存在於 final」的 key
+                    pending_en = {
+                        k: v
+                        for k, v in pending_en.items()
+                        if not (
+                            is_filled_text_impl(v)
+                            and v in reverse_index
+                            and k != reverse_index[v][0]
+                        )
+                    }


Cross-namespace key comparison may silently drop entries

reverse_index is keyed on values from final_tw_lookup (zh_tw translations), and its lists contain keys from the final directory (e.g. "item.mymod.foo"). The dedup filter checks k != reverse_index[v][0], where k is a key from the raw/pending source file — a completely different namespace.

If pending_en has {"item.othermod.bar": "Hello World"} and reverse_index["Hello World"] == ["item.mymod.foo"], the condition becomes "item.othermod.bar" != "item.mymod.foo" → True, so the entry is filtered out. The intended behaviour — keeping the entry when the same key already exists in the final output — is therefore never triggered for keys that don't coincidentally match a final-directory key name.

In practice this means any pending entry whose English value also appears anywhere in final_tw_lookup (regardless of key) will be dropped, which is likely more aggressive deduplication than intended. You may want to check for key equality within the same key namespace, or reconsider the dedup condition entirely.

greptile-apps · 2026-03-24T12:47:09Z

translation_tool/checkers/color_char_checker.py

+# 核心檢查：& 後只能接 a-v（不含 w）、0-9、空格、\、#
+# 合法字元：a-v, 0-9, whitespace, backslash, hash
+# 違法：& 後面接了 a-v 與 0-9、空格、\、# 以外的任何字元
+COLOR_PATTERN = re.compile(r"&([^a-v0-9\s\\#])")


COLOR_PATTERN whitelist is too broad — misses many invalid codes

The character class [^a-v0-9\s\\#] treats every letter a–v as a legal format code. The actual valid Minecraft format characters are:

Category Characters

Color 0–9, a–f

Formatting k, l, m, n, o, r

Hex prefix (Paper/Spigot) x

Characters such as g, h, i, j, p, q, s, t, u, v are all inside the a-v range and will not be flagged as illegal — even though &g, &p, &v, etc. have no defined meaning in vanilla Minecraft and will render incorrectly.

If the intent is strict validation:

# Legal: 0-9, a-f (colour), k-o + r (format), optional x (hex), whitespace, \, # COLOR_PATTERN = re.compile(r"&([^0-9a-fk-orx\s\\#])", re.IGNORECASE)

If some non-standard mod codes (e.g. g–j) must be kept, the allowed set should be explicitly documented.

greptile-apps · 2026-03-24T12:47:10Z

translation_tool/checkers/color_char_checker.py

+def check_color_chars(value: str) -> list[ColorCharError] | None:
+    """檢查單一字串中的非法顏色字元。
+
+    Args:
+        value: 要檢查的字串。
+
+    Returns:
+        錯誤列表（若無錯誤則回傳 None）。
+    """
+    errors: list[ColorCharError] = []
+    for match in COLOR_PATTERN.finditer(value):
+        illegal_char = match.group(1)
+        pos = match.start()
+        errors.append(
+            ColorCharError(
+                file_path="",
+                key="",
+                value=value,
+                illegal_char=illegal_char,
+                position=pos,
+                message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}'，"
+                f"& 後只能接 a-v（不含 w）、0-9、空格、\\、#。",
+            )
+        )
+    return errors if errors else None


Inconsistent return type (list | None) instead of list

check_color_chars returns None on success and a non-empty list on failure. Returning an empty list on success is more Pythonic and avoids callers needing a None guard — it also makes the declared return type truthful (right now the annotation says list[ColorCharError] | None but an empty list is never actually returned).

Suggested change

def check_color_chars(value: str) -> list[ColorCharError] | None:

"""檢查單一字串中的非法顏色字元。

Args:

value: 要檢查的字串。

Returns:

錯誤列表（若無錯誤則回傳 None）。

"""

errors: list[ColorCharError] = []

for match in COLOR_PATTERN.finditer(value):

illegal_char = match.group(1)

pos = match.start()

errors.append(

ColorCharError(

file_path="",

key="",

value=value,

illegal_char=illegal_char,

position=pos,

message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}'，"

f"& 後只能接 a-v（不含 w）、0-9、空格、\\、#。",

)

)

return errors if errors else None

def check_color_chars(value: str) -> list[ColorCharError]:

"""檢查單一字串中的非法顏色字元。

Args:

value: 要檢查的字串。

Returns:

錯誤列表（若無錯誤則回傳空列表）。

"""

errors: list[ColorCharError] = []

for match in COLOR_PATTERN.finditer(value):

illegal_char = match.group(1)

pos = match.start()

errors.append(

ColorCharError(

file_path="",

key="",

value=value,

illegal_char=illegal_char,

position=pos,

message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}'，"

f"& 後只能接 a-v（不含 w）、0-9、空格、\\、#。",

)

)

return errors

The caller _check_value can then simply do if errors := check_color_chars(value):.

greptile-apps · 2026-03-24T12:47:11Z

translation_tool/checkers/color_char_checker.py

+    elif isinstance(value, dict):
+        for k, v in value.items():
+            yield from _check_value(file_path, k, v)


Nested-dict path context is lost

When value is a dict, the function recurses with only the child key k, discarding the parent entirely. For a nested structure like {"parent": {"child": "bad&Wvalue"}}, the resulting ColorCharError will report the leaf key alone (child) instead of the full dotted path (parent.child). By contrast, the list branch correctly preserves context with f"{key}[{i}]".

Consider building a dotted path for dict keys to keep errors self-describing:

Suggested change

elif isinstance(value, dict):

for k, v in value.items():

yield from _check_value(file_path, k, v)

elif isinstance(value, dict):

for k, v in value.items():

nested_key = f"{key}.{k}" if key else k

yield from _check_value(file_path, nested_key, v)

jlin53882 added 4 commits March 23, 2026 14:10

fix(checkers): remove z from COLOR_PATTERN (z is not a valid minecraf…

7f54d0c

…t color code)

style: apply ruff format

8da65a6

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

jlin53882 force-pushed the main branch from c9836ca to e87226d Compare March 27, 2026 16:01

jlin53882 force-pushed the pr/color-checker branch from 7806250 to 8da65a6 Compare March 27, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(checkers): 新增顏色字元校驗工具#41

feat(checkers): 新增顏色字元校驗工具#41
jlin53882 wants to merge 4 commits intomainfrom
pr/color-checker

jlin53882 commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

greptile-apps bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Category	Characters
Color	`0`–`9`, `a`–`f`
Formatting	`k`, `l`, `m`, `n`, `o`, `r`
Hex prefix (Paper/Spigot)	`x`

Conversation

jlin53882 commented Mar 23, 2026

實作摘要

改了什麼

為什麼要改

怎麼驗證的

檔案變更

Uh oh!

greptile-apps bot commented Mar 24, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant