Skip to content

feat(checkers): 新增顏色字元校驗工具#41

Open
jlin53882 wants to merge 4 commits intomainfrom
pr/color-checker
Open

feat(checkers): 新增顏色字元校驗工具#41
jlin53882 wants to merge 4 commits intomainfrom
pr/color-checker

Conversation

@jlin53882
Copy link
Copy Markdown
Owner

實作摘要

改了什麼

  • 新增 color_char_checker.py 模組
  • 提供 check_color_chars() 函式,檢查翻譯檔案中的非法顏色字元
  • 定義 ColorCharError dataclass 記錄錯誤详情

為什麼要改

  • Minecraft 顏色代碼格式嚴謹,& 後只能接特定字元
  • 提前發現非法顏色字元可避免翻譯輸出錯誤

怎麼驗證的

  • 單元測試驗證合法/非法顏色字元判斷正確
  • 測試涵蓋:標準彩色碼、十六進位顏色、逸出字元等

檔案變更

  • ranslation_tool/checkers/color_char_checker.py (+115 行)

- 在 clean_kubejs_from_raw_impl 中,pending_en 寫入前新增 reverse_index dedup
- 若某英文文字(value)已出現在 final/zh_tw.json(不同 key),則跳過不寫入 pending
- 避免同一英文原文因不同 key 而重複翻譯
- 新增 ColorCharError dataclass,記錄非法顏色字元錯誤
- 實作 COLOR_PATTERN 正則:& 後只能接 a-v(不含 w)、0-9、空格、\、#
- check_color_chars():檢查單一字串中的非法顏色字元
- check_json_file():讀取 JSON 並遞迴檢查所有字串值
- check_directory():遞迴檢查目錄下所有 .json 檔
- 遵循現有 checkers Generator yield 模式
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR introduces a new color_char_checker.py module to detect illegal Minecraft colour-code characters in translation files, and adds a reverse-index deduplication block inside kubejs_translator_clean.py to skip pending entries whose English source text already appears in the final output directory.

Key issues found:

  • COLOR_PATTERN whitelist too broad (color_char_checker.py line 16): the character class [^a-v0-9\s\\#] treats the entire range av as legal, so invalid codes like &g, &h, &j, &p, &q, &s, &t, &u, &v are silently accepted. The valid Minecraft set is much narrower (0-9, a-f, k-o, r, optionally x).
  • Performance regression (kubejs_translator_clean.py lines 210–234): final_tw_lookup and reverse_index are fully rebuilt on every iteration of the groups loop by re-scanning all zh_tw.json files under final_root_p. This should be computed once before the loop.
  • Cross-namespace key comparison (kubejs_translator_clean.py lines 225–234): the dedup filter uses k != reverse_index[v][0], where k is a key from the raw/pending source files and reverse_index[v][0] is a key from the final directory. Because these come from different file namespaces they rarely match, meaning any pending entry whose English value also appears anywhere in final_tw_lookup is unconditionally dropped — likely more aggressive than intended.
  • Minor style: check_color_chars returns None instead of [] on success (inconsistent return type), and nested-dict traversal discards the parent-key path from error messages.

Confidence Score: 2/5

  • This PR has two logic bugs that could cause silent data loss and a performance regression; not safe to merge without fixes.
  • The regex whitelist being too broad defeats the main purpose of the new checker module. The deduplication logic in kubejs_translator_clean.py can silently drop valid pending-translation entries due to a cross-namespace key comparison, which is a correctness bug. Additionally, rebuilding the full reverse-index on every group is a performance problem that will worsen as the translation repository grows.
  • Both changed files need attention: translation_tool/core/kubejs_translator_clean.py for the deduplication correctness and performance issues, and translation_tool/checkers/color_char_checker.py for the overly permissive regex.

Important Files Changed

Filename Overview
translation_tool/checkers/color_char_checker.py New checker module with a regex whitelist (a-v) that is significantly broader than the valid Minecraft color-code set, causing the checker to silently miss many invalid codes (e.g. &g, &p, &v). Minor issues: inconsistent `list
translation_tool/core/kubejs_translator_clean.py New deduplication block rebuilds final_tw_lookup and reverse_index on every group iteration (O(N×files) scan). The dedup filter condition also compares keys from different namespaces (raw vs. final), which can silently over-drop pending entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([check_directory]) -->|os.walk .json files| B[check_json_file]
    B -->|json.load| C{root is dict?}
    C -- Yes --> D[_check_value per key]
    C -- No --> E([silently skipped])
    D -->|str| F[check_color_chars\nregex: &[^a-v0-9\\s\\\\#]]
    F --> G{match found?}
    G -- Yes --> H[yield ColorCharError\nfile_path='', key='']
    G -- No --> I([return None])
    H --> J[_check_value patches\nfile_path & key]
    D -->|dict| K[recurse with child key\nparent key lost]
    D -->|list| L[recurse with key[i]\nparent path preserved]

    subgraph kubejs_translator_clean [kubejs_translator_clean — reverse_index dedup]
        M([for each group]) --> N{pending_en\n& final_root exists?}
        N -- Yes --> O[rglob zh_tw.json\n⚠️ repeated per group]
        O --> P[build final_tw_lookup]
        P --> Q[build reverse_index\nvalue → keys]
        Q --> R[filter pending_en:\ndrop if v in reverse_index\n& k != reverse_index[v][0]\n⚠️ cross-namespace key compare]
        R --> S([write en_us.json])
    end
Loading

Reviews (1): Last reviewed commit: "style: apply ruff format" | Re-trigger Greptile

Comment on lines +210 to +234
if pending_en and final_root_p.exists():
# 從 final/zh_tw.json 建立 final_tw_lookup(key → 原文)
final_tw_lookup: dict[str, str] = {}
for tw_file in final_root_p.rglob("zh_tw.json"):
tw_data = read_json_dict_fn(tw_file)
if tw_data:
final_tw_lookup.update(tw_data)

if final_tw_lookup:
# 建立 reverse_index(英文文字 → 對應 key 列表)
reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
if is_filled_text_impl(v):
reverse_index.setdefault(v, []).append(k)

# 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k != reverse_index[v][0]
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 final_tw_lookup rebuilt on every group iteration

The entire final_root_p.rglob("zh_tw.json") scan and reverse_index construction happen inside the for group_dir, files_map in groups.items() loop (line 185). For N groups this becomes O(N × number-of-zh_tw-files) — all zh_tw.json files are read and merged from scratch on every group, even though the final directory does not change between iterations.

Move both final_tw_lookup and reverse_index outside the loop so they are built once:

# Build once before the loop
final_tw_lookup: dict[str, str] = {}
if final_root_p.exists():
    for tw_file in final_root_p.rglob("zh_tw.json"):
        tw_data = read_json_dict_fn(tw_file)
        if tw_data:
            final_tw_lookup.update(tw_data)

reverse_index: dict[str, list[str]] = {}
for k, v in final_tw_lookup.items():
    if is_filled_text_impl(v):
        reverse_index.setdefault(v, []).append(k)

for group_dir, files_map in groups.items():
    ...
    if pending_en and reverse_index:
        pending_en = {
            k: v
            for k, v in pending_en.items()
            if not (
                is_filled_text_impl(v)
                and v in reverse_index
                and k != reverse_index[v][0]
            )
        }

Comment on lines +225 to +234
# 過濾 pending_en:跳過那些「英文文字已存在於 final」的 key
pending_en = {
k: v
for k, v in pending_en.items()
if not (
is_filled_text_impl(v)
and v in reverse_index
and k != reverse_index[v][0]
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Cross-namespace key comparison may silently drop entries

reverse_index is keyed on values from final_tw_lookup (zh_tw translations), and its lists contain keys from the final directory (e.g. "item.mymod.foo"). The dedup filter checks k != reverse_index[v][0], where k is a key from the raw/pending source file — a completely different namespace.

If pending_en has {"item.othermod.bar": "Hello World"} and reverse_index["Hello World"] == ["item.mymod.foo"], the condition becomes "item.othermod.bar" != "item.mymod.foo"True, so the entry is filtered out. The intended behaviour — keeping the entry when the same key already exists in the final output — is therefore never triggered for keys that don't coincidentally match a final-directory key name.

In practice this means any pending entry whose English value also appears anywhere in final_tw_lookup (regardless of key) will be dropped, which is likely more aggressive deduplication than intended. You may want to check for key equality within the same key namespace, or reconsider the dedup condition entirely.

# 核心檢查:& 後只能接 a-v(不含 w)、0-9、空格、\、#
# 合法字元:a-v, 0-9, whitespace, backslash, hash
# 違法:& 後面接了 a-v 與 0-9、空格、\、# 以外的任何字元
COLOR_PATTERN = re.compile(r"&([^a-v0-9\s\\#])")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 COLOR_PATTERN whitelist is too broad — misses many invalid codes

The character class [^a-v0-9\s\\#] treats every letter av as a legal format code. The actual valid Minecraft format characters are:

Category Characters
Color 09, af
Formatting k, l, m, n, o, r
Hex prefix (Paper/Spigot) x

Characters such as g, h, i, j, p, q, s, t, u, v are all inside the a-v range and will not be flagged as illegal — even though &g, &p, &v, etc. have no defined meaning in vanilla Minecraft and will render incorrectly.

If the intent is strict validation:

# Legal: 0-9, a-f (colour), k-o + r (format), optional x (hex), whitespace, \, #
COLOR_PATTERN = re.compile(r"&([^0-9a-fk-orx\s\\#])", re.IGNORECASE)

If some non-standard mod codes (e.g. gj) must be kept, the allowed set should be explicitly documented.

Comment on lines +31 to +55
def check_color_chars(value: str) -> list[ColorCharError] | None:
"""檢查單一字串中的非法顏色字元。

Args:
value: 要檢查的字串。

Returns:
錯誤列表(若無錯誤則回傳 None)。
"""
errors: list[ColorCharError] = []
for match in COLOR_PATTERN.finditer(value):
illegal_char = match.group(1)
pos = match.start()
errors.append(
ColorCharError(
file_path="",
key="",
value=value,
illegal_char=illegal_char,
position=pos,
message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}',"
f"& 後只能接 a-v(不含 w)、0-9、空格、\\、#。",
)
)
return errors if errors else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent return type (list | None) instead of list

check_color_chars returns None on success and a non-empty list on failure. Returning an empty list on success is more Pythonic and avoids callers needing a None guard — it also makes the declared return type truthful (right now the annotation says list[ColorCharError] | None but an empty list is never actually returned).

Suggested change
def check_color_chars(value: str) -> list[ColorCharError] | None:
"""檢查單一字串中的非法顏色字元
Args:
value: 要檢查的字串
Returns:
錯誤列表若無錯誤則回傳 None)。
"""
errors: list[ColorCharError] = []
for match in COLOR_PATTERN.finditer(value):
illegal_char = match.group(1)
pos = match.start()
errors.append(
ColorCharError(
file_path="",
key="",
value=value,
illegal_char=illegal_char,
position=pos,
message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}',"
f"& 後只能接 a-v(不含 w)、0-9、空格、\\、#。",
)
)
return errors if errors else None
def check_color_chars(value: str) -> list[ColorCharError]:
"""檢查單一字串中的非法顏色字元
Args:
value: 要檢查的字串
Returns:
錯誤列表若無錯誤則回傳空列表)。
"""
errors: list[ColorCharError] = []
for match in COLOR_PATTERN.finditer(value):
illegal_char = match.group(1)
pos = match.start()
errors.append(
ColorCharError(
file_path="",
key="",
value=value,
illegal_char=illegal_char,
position=pos,
message=f"在位置 {pos} 發現非法顏色字元 '&{illegal_char}',"
f"& 後只能接 a-v(不含 w)、0-9、空格、\\、#。",
)
)
return errors

The caller _check_value can then simply do if errors := check_color_chars(value):.

Comment on lines +77 to +79
elif isinstance(value, dict):
for k, v in value.items():
yield from _check_value(file_path, k, v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Nested-dict path context is lost

When value is a dict, the function recurses with only the child key k, discarding the parent entirely. For a nested structure like {"parent": {"child": "bad&Wvalue"}}, the resulting ColorCharError will report the leaf key alone (child) instead of the full dotted path (parent.child). By contrast, the list branch correctly preserves context with f"{key}[{i}]".

Consider building a dotted path for dict keys to keep errors self-describing:

Suggested change
elif isinstance(value, dict):
for k, v in value.items():
yield from _check_value(file_path, k, v)
elif isinstance(value, dict):
for k, v in value.items():
nested_key = f"{key}.{k}" if key else k
yield from _check_value(file_path, nested_key, v)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant