Skip to content

Commit 8afde72

Browse files
authored
chore(security): comprehensive Unicode byte-boundary threat analysis (TM-UNI)
Add TM-UNI threat category (TM-UNI-001 through TM-UNI-019). 68 security tests, full codebase audit, issues #434-#438 filed.
1 parent 6556e8c commit 8afde72

File tree

4 files changed

+1464
-2
lines changed

4 files changed

+1464
-2
lines changed

crates/bashkit/docs/threat-model.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -390,11 +390,71 @@ let bash = Bash::builder()
390390
// Commits use virtual identity, never host ~/.gitconfig
391391
```
392392

393+
### Unicode Security (TM-UNI-*)
394+
395+
Unicode input from untrusted scripts creates attack surface across the parser, builtins,
396+
and virtual filesystem. AI agents frequently generate multi-byte Unicode (box-drawing,
397+
emoji, CJK) that exercises these code paths.
398+
399+
**Byte-Boundary Safety (TM-UNI-001/002/015/016/017):**
400+
401+
Multiple builtins mix byte offsets with character indices, causing panics on multi-byte
402+
input. All are caught by `catch_unwind` (TM-INT-001) preventing process crash, but the
403+
builtin silently fails.
404+
405+
| Threat | Attack Example | Mitigation | Status |
406+
|--------|---------------|------------|--------|
407+
| Awk byte-boundary panic (TM-UNI-001) | Multi-byte chars in awk input | `catch_unwind` catches panic | PARTIAL |
408+
| Sed byte-boundary panic (TM-UNI-002) | Box-drawing chars in sed pattern | `catch_unwind` catches panic | PARTIAL |
409+
| Expr substr panic (TM-UNI-015) | `expr substr "café" 4 1` | `catch_unwind` catches panic | PARTIAL |
410+
| Printf precision panic (TM-UNI-016) | `printf "%.1s" "é"` | `catch_unwind` catches panic | PARTIAL |
411+
| Cut/tr byte-level parsing (TM-UNI-017) | `tr 'é' 'e'` — multi-byte in char set | `catch_unwind` catches; silent data loss | PARTIAL |
412+
413+
**Additional Byte/Char Confusion:**
414+
415+
| Threat | Attack Example | Mitigation | Status |
416+
|--------|---------------|------------|--------|
417+
| Interpreter arithmetic (TM-UNI-018) | Multi-byte before `=` in arithmetic | Wrong operator detection; no panic | PARTIAL |
418+
| Network allowlist (TM-UNI-019) | Multi-byte in allowlist URL path | Wrong path boundary check | PARTIAL |
419+
| Zero-width in filenames (TM-UNI-003) | Invisible chars create confusable names | Path validation (planned) | UNMITIGATED |
420+
| Homoglyph confusion (TM-UNI-006) | Cyrillic 'а' vs Latin 'a' in filenames | Accepted risk | ACCEPTED |
421+
| Normalization bypass (TM-UNI-008) | NFC vs NFD create distinct files | Matches Linux FS behavior | ACCEPTED |
422+
| Bidi in script source (TM-UNI-014) | RTL overrides hide malicious code | Scripts untrusted by design | ACCEPTED |
423+
424+
**Safe Components (confirmed by full codebase audit):**
425+
- Lexer: `Chars` iterator with `ch.len_utf8()` tracking
426+
- wc: Correct `.len()` vs `.chars().count()` usage
427+
- grep/jq: Delegate to Unicode-aware regex/jaq crates
428+
- sort/uniq: String comparison, no byte indexing
429+
- logging: Uses `is_char_boundary()` correctly
430+
- python: Shebang strip via `find('\n')` — ASCII delimiter, safe
431+
- Python bindings (bashkit-python): PyO3 `String` extraction, no manual byte/char ops
432+
- eval harness: `chars().take()`, `from_utf8_lossy()` — all safe patterns
433+
- curl/bc/export/date/comm/echo/archive/base64: All `.find()` use ASCII delimiters only
434+
- scripted_tool: No byte/char patterns
435+
436+
**Path Validation:**
437+
438+
Filenames are validated by `find_unsafe_path_char()` which rejects:
439+
- ASCII control characters (U+0000-U+001F, U+007F)
440+
- C1 control characters (U+0080-U+009F)
441+
- Bidi override characters (U+202A-U+202E, U+2066-U+2069)
442+
443+
Normal Unicode (accented, CJK, emoji) is allowed in filenames and script content.
444+
445+
**Caller Responsibility:**
446+
- Strip zero-width/invisible characters from filenames before displaying to users
447+
- Apply confusable-character detection (UTS #39) if showing filenames to humans
448+
- Strip bidi overrides from script source before displaying to code reviewers
449+
- Be aware that expr/printf/cut/tr may fail on non-ASCII input until fixes land
450+
- Use ASCII in network allowlist URL patterns until byte/char fix lands
451+
393452
## Security Testing
394453

395454
Bashkit includes comprehensive security tests:
396455

397456
- **Threat Model Tests**: [`tests/threat_model_tests.rs`][threat_tests] - 117 tests
457+
- **Unicode Security Tests**: `tests/unicode_security_tests.rs` - TM-UNI-* tests
398458
- **Nesting Depth Tests**: 18 tests covering positive, negative, misconfiguration,
399459
and regression scenarios for parser depth attacks
400460
- **Fail-Point Tests**: [`tests/security_failpoint_tests.rs`][failpoint_tests] - 14 tests
@@ -425,6 +485,7 @@ All threats use stable IDs in the format `TM-<CATEGORY>-<NUMBER>`:
425485
| TM-LOG | Logging Security |
426486
| TM-GIT | Git Security |
427487
| TM-PY | Python/Monty Security |
488+
| TM-UNI | Unicode Security |
428489

429490
Full threat analysis: [`specs/006-threat-model.md`][spec]
430491

0 commit comments

Comments
 (0)