@@ -390,11 +390,71 @@ let bash = Bash::builder()
390390// Commits use virtual identity, never host ~/.gitconfig
391391```
392392
393+ ### Unicode Security (TM-UNI-* )
394+
395+ Unicode input from untrusted scripts creates attack surface across the parser, builtins,
396+ and virtual filesystem. AI agents frequently generate multi-byte Unicode (box-drawing,
397+ emoji, CJK) that exercises these code paths.
398+
399+ ** Byte-Boundary Safety (TM-UNI-001/002/015/016/017):**
400+
401+ Multiple builtins mix byte offsets with character indices, causing panics on multi-byte
402+ input. All are caught by ` catch_unwind ` (TM-INT-001) preventing process crash, but the
403+ builtin silently fails.
404+
405+ | Threat | Attack Example | Mitigation | Status |
406+ | --------| ---------------| ------------| --------|
407+ | Awk byte-boundary panic (TM-UNI-001) | Multi-byte chars in awk input | ` catch_unwind ` catches panic | PARTIAL |
408+ | Sed byte-boundary panic (TM-UNI-002) | Box-drawing chars in sed pattern | ` catch_unwind ` catches panic | PARTIAL |
409+ | Expr substr panic (TM-UNI-015) | ` expr substr "café" 4 1 ` | ` catch_unwind ` catches panic | PARTIAL |
410+ | Printf precision panic (TM-UNI-016) | ` printf "%.1s" "é" ` | ` catch_unwind ` catches panic | PARTIAL |
411+ | Cut/tr byte-level parsing (TM-UNI-017) | ` tr 'é' 'e' ` — multi-byte in char set | ` catch_unwind ` catches; silent data loss | PARTIAL |
412+
413+ ** Additional Byte/Char Confusion:**
414+
415+ | Threat | Attack Example | Mitigation | Status |
416+ | --------| ---------------| ------------| --------|
417+ | Interpreter arithmetic (TM-UNI-018) | Multi-byte before ` = ` in arithmetic | Wrong operator detection; no panic | PARTIAL |
418+ | Network allowlist (TM-UNI-019) | Multi-byte in allowlist URL path | Wrong path boundary check | PARTIAL |
419+ | Zero-width in filenames (TM-UNI-003) | Invisible chars create confusable names | Path validation (planned) | UNMITIGATED |
420+ | Homoglyph confusion (TM-UNI-006) | Cyrillic 'а' vs Latin 'a' in filenames | Accepted risk | ACCEPTED |
421+ | Normalization bypass (TM-UNI-008) | NFC vs NFD create distinct files | Matches Linux FS behavior | ACCEPTED |
422+ | Bidi in script source (TM-UNI-014) | RTL overrides hide malicious code | Scripts untrusted by design | ACCEPTED |
423+
424+ ** Safe Components (confirmed by full codebase audit):**
425+ - Lexer: ` Chars ` iterator with ` ch.len_utf8() ` tracking
426+ - wc: Correct ` .len() ` vs ` .chars().count() ` usage
427+ - grep/jq: Delegate to Unicode-aware regex/jaq crates
428+ - sort/uniq: String comparison, no byte indexing
429+ - logging: Uses ` is_char_boundary() ` correctly
430+ - python: Shebang strip via ` find('\n') ` — ASCII delimiter, safe
431+ - Python bindings (bashkit-python): PyO3 ` String ` extraction, no manual byte/char ops
432+ - eval harness: ` chars().take() ` , ` from_utf8_lossy() ` — all safe patterns
433+ - curl/bc/export/date/comm/echo/archive/base64: All ` .find() ` use ASCII delimiters only
434+ - scripted_tool: No byte/char patterns
435+
436+ ** Path Validation:**
437+
438+ Filenames are validated by ` find_unsafe_path_char() ` which rejects:
439+ - ASCII control characters (U+0000-U+001F, U+007F)
440+ - C1 control characters (U+0080-U+009F)
441+ - Bidi override characters (U+202A-U+202E, U+2066-U+2069)
442+
443+ Normal Unicode (accented, CJK, emoji) is allowed in filenames and script content.
444+
445+ ** Caller Responsibility:**
446+ - Strip zero-width/invisible characters from filenames before displaying to users
447+ - Apply confusable-character detection (UTS #39 ) if showing filenames to humans
448+ - Strip bidi overrides from script source before displaying to code reviewers
449+ - Be aware that expr/printf/cut/tr may fail on non-ASCII input until fixes land
450+ - Use ASCII in network allowlist URL patterns until byte/char fix lands
451+
393452## Security Testing
394453
395454Bashkit includes comprehensive security tests:
396455
397456- ** Threat Model Tests** : [ ` tests/threat_model_tests.rs ` ] [ threat_tests ] - 117 tests
457+ - ** Unicode Security Tests** : ` tests/unicode_security_tests.rs ` - TM-UNI-* tests
398458- ** Nesting Depth Tests** : 18 tests covering positive, negative, misconfiguration,
399459 and regression scenarios for parser depth attacks
400460- ** Fail-Point Tests** : [ ` tests/security_failpoint_tests.rs ` ] [ failpoint_tests ] - 14 tests
@@ -425,6 +485,7 @@ All threats use stable IDs in the format `TM-<CATEGORY>-<NUMBER>`:
425485| TM-LOG | Logging Security |
426486| TM-GIT | Git Security |
427487| TM-PY | Python/Monty Security |
488+ | TM-UNI | Unicode Security |
428489
429490Full threat analysis: [ ` specs/006-threat-model.md ` ] [ spec ]
430491
0 commit comments