Skip to content

Commit 1d65126

Browse files
chaliyclaude
andauthored
test: add skipped tests for eval-surfaced interpreter bugs (#214)
## Summary - Add 10 skipped spec tests covering 8 interpreter bugs surfaced by cross-model eval analysis - Tests organized by tool: bash/ (4 tests), awk/ (2), grep/ (2), sed/ (2) - All marked `### skip` with root cause, affected eval tasks, and expected behavior documented ## Bugs Covered | Bug | Test File | Blocks Models | |-----|-----------|---------------| | `tr` character class from pipe → empty output | bash/eval-bugs | All 4 | | `while read` in pipe subshell → empty vars | bash/eval-bugs | All 4 | | `tail -n +N` → wrong content | bash/eval-bugs | 3/4 | | `chmod +x` + path exec → command not found | bash/eval-bugs | 3/4 | | `awk` `$2 * $3` accumulation → wrong sum | awk/eval-bugs | 2/4 | | `awk match()` 3rd arg capture array | awk/eval-bugs | 3/4 | | `grep` BRE treats `(` as ERE metachar | grep/eval-bugs | 3/4 | | `sed` capture groups in complex patterns → no-op | sed/eval-bugs | 3/4 | ## Test plan - [x] `cargo test --test spec_tests` passes (all new tests properly skipped) - [x] `cargo fmt --check` clean - [x] `cargo clippy --all-targets --all-features -- -D warnings` clean --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent b5f5f21 commit 1d65126

9 files changed

+11300
-44
lines changed

crates/bashkit-eval/README.md

Lines changed: 78 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,83 @@ Smoke test dataset (`data/smoke-test.jsonl`) has 3 tasks for quick verification.
4646

4747
## Results
4848

49-
### 2026-02-09 — Expanded Dataset (37 tasks, latest)
49+
### 2026-02-17 — Sonnet 4 Baseline (37 tasks, latest)
50+
51+
First eval run with Claude Sonnet 4. Sonnet matches Haiku's pass rate (32/37) while achieving
52+
the highest tool call success rate (89%) of any model tested. Notably fixes `data_column_transform`
53+
and `complex_diff_report` that tripped up other models, but shares the same systemic bashkit-bug
54+
failures (`text_csv_revenue`, `script_function_lib`, `complex_markdown_toc`).
55+
56+
| Metric | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
57+
|--------|----------|-----------|----------|---------|
58+
| Tasks passed | 32/37 | 32/37 | 29/37 | 23/37 |
59+
| Score | 93% | **95%** | 87% | 80% |
60+
| Tool calls | 182 (162 ok, 20 err) | 150 (121 ok, 29 err) | 198 (163 ok, 35 err) | 108 (77 ok, 31 err) |
61+
| Tool call success | **89%** | 81% | 82% | 71% |
62+
| Tokens | 248K in / 30K out | 286K in / 35K out | 315K in / 31K out | 119K in / 17K out |
63+
| Duration | 10.2 min | 6.4 min | 25.2 min | 4.8 min |
5064

51-
Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, schema migration, JSON→CSV, package.json update, group-by aggregation) and 6 gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join). Removed tool-steering from all prompts. Renamed `jq_mastery``json_processing`.
65+
#### Per-Category Comparison
66+
67+
| Category | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
68+
|----------|----------|-----------|----------|---------|
69+
| archive_operations | 100% | 100% | 100% | 17% |
70+
| complex_tasks | 62% | 92% | 54% | 67% |
71+
| data_transformation | **100%** | 93% | 90% | 90% |
72+
| error_recovery | 100% | 100% | 100% | 100% |
73+
| file_operations | 100% | 100% | 100% | 100% |
74+
| json_processing | 96% | 92% | 91% | 89% |
75+
| pipelines | 100% | 100% | 100% | 80% |
76+
| scripting | 95% | 95% | 95% | 53% |
77+
| system_info | 100% | 100% | 100% | 100% |
78+
| text_processing | 92% | 92% | 69% | 69% |
79+
80+
#### Cross-Model Failure Analysis
81+
82+
| Task | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 | Root Cause |
83+
|------|----------|-----------|----------|---------|------------|
84+
| text_csv_revenue | FAIL | FAIL | PASS | PASS | bashkit `awk` arithmetic bug |
85+
| script_function_lib | FAIL | FAIL | FAIL | FAIL | bashkit `tr` character class bug |
86+
| complex_markdown_toc | FAIL | FAIL | FAIL | FAIL | bashkit pipe-to-while-loop + turn budget |
87+
| json_to_csv_export | FAIL | FAIL | PASS | FAIL | jq `@csv` quoting vs eval expectations |
88+
| complex_release_notes | FAIL | PASS | FAIL | FAIL | bashkit `grep`/`sed`/`awk` regex bugs |
89+
| data_column_transform | PASS | FAIL | FAIL | PASS | model-specific |
90+
| data_csv_to_json | PASS | PASS | FAIL | PASS | model-specific |
91+
| complex_todo_app | PASS | PASS | FAIL | PASS | model-specific |
92+
| json_config_merge | PASS | PASS | FAIL | PASS | model-specific |
93+
| text_multifile_replace | PASS | PASS | FAIL | FAIL | model-specific |
94+
95+
Two tasks (`script_function_lib`, `complex_markdown_toc`) fail across **all four models** — these
96+
are bashkit interpreter limitations, not model weaknesses. Three more fail on 3/4 models
97+
(`json_to_csv_export`, `complex_release_notes`, `text_csv_revenue`), also driven by interpreter bugs.
98+
99+
#### Bashkit Interpreter Bugs Surfaced
100+
101+
| Bug | Affected Tasks | Impact |
102+
|-----|---------------|--------|
103+
| `tr '[:lower:]' '[:upper:]'` produces empty output from pipe | script_function_lib | Blocks all models |
104+
| Variables empty inside `while read` in pipe subshell | complex_markdown_toc | Blocks all models |
105+
| `awk` `$2 * $3` accumulation returns wrong result | text_csv_revenue | Wrong math (204 vs 329) |
106+
| `grep` treats `(` as ERE metachar in default BRE mode | complex_release_notes | Requires `\(` escaping |
107+
| `sed` capture group substitution `\1`/`\2` has no effect | complex_release_notes | Silent no-op |
108+
| `awk match()` with capture array unsupported | complex_release_notes, complex_markdown_toc | Error on valid GNU awk |
109+
| `tail -n +N` returns wrong content | complex_markdown_toc | Returns only last section |
110+
| Script execution via `chmod +x` + path fails | complex_release_notes | "command not found" |
111+
112+
#### Model Behavior
113+
114+
- **Sonnet 4** highest tool call success rate (89%); efficient token usage; shares Haiku's failure
115+
profile on bashkit-bug tasks; struggles on `complex_release_notes` due to cascading interpreter bugs
116+
- **Haiku 4.5** best score/cost ratio (95% score, fastest) — adapts to bashkit quirks, retries with simpler constructs
117+
- **Opus 4.6** struggles on multi-step complex_tasks (54%) but strong on JSON processing; slowest due to longer reasoning
118+
- **GPT-5.2** tends to repeat failing patterns and often omits writing output to files
119+
120+
### Previous Results
121+
122+
<details>
123+
<summary>2026-02-09 — Expanded Dataset (37 tasks)</summary>
124+
125+
Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, schema migration, JSON-CSV, package.json update, group-by aggregation) and 6 gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join). Removed tool-steering from all prompts. Renamed `jq_mastery` to `json_processing`.
52126

53127
| Metric | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
54128
|--------|-----------|----------|---------|
@@ -59,50 +133,10 @@ Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, sch
59133
| Tokens | 286K in / 35K out | 315K in / 31K out | 119K in / 17K out |
60134
| Duration | 6.4 min | 25.2 min | 4.8 min |
61135

62-
#### Per-Category Comparison
63-
64-
| Category | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
65-
|----------|-----------|----------|---------|
66-
| archive_operations | 100% | 100% | 17% |
67-
| complex_tasks | 92% | 54% | 67% |
68-
| data_transformation | 93% | 90% | 90% |
69-
| error_recovery | 100% | 100% | 100% |
70-
| file_operations | 100% | 100% | 100% |
71-
| json_processing | 92% | 91% | 89% |
72-
| pipelines | 100% | 100% | 80% |
73-
| scripting | 95% | 95% | 53% |
74-
| system_info | 100% | 100% | 100% |
75-
| text_processing | 92% | 69% | 69% |
76-
77-
#### New Scenario Performance
78-
79-
| Task | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
80-
|------|-----------|----------|---------|
81-
| json_config_merge | PASS | FAIL | PASS |
82-
| json_ndjson_error_aggregate | PASS | PASS | PASS |
83-
| json_api_schema_migration | PASS | PASS | PASS |
84-
| json_to_csv_export | FAIL | PASS | FAIL |
85-
| json_package_update | PASS | PASS | FAIL |
86-
| json_order_totals | PASS | PASS | PASS |
87-
| pipe_dedup_merge | PASS | PASS | FAIL |
88-
| text_multifile_replace | PASS | FAIL | FAIL |
89-
| script_health_check | PASS | PASS | PASS |
90-
| data_column_transform | FAIL | FAIL | PASS |
91-
| complex_release_notes | PASS | FAIL | FAIL |
92-
| data_csv_join | PASS | PASS | PASS |
93-
94-
No single new scenario fails across all three models — failures are model-specific, not bashkit limitations. `data_column_transform` and `text_multifile_replace` trip up two of three models each.
95-
96-
#### Model Behavior
97-
98-
- **Haiku 4.5** remains the best score/cost ratio — adapts to bashkit quirks, retries with simpler constructs
99-
- **Opus 4.6** struggles on multi-step complex_tasks (54%) but strong on JSON processing; slowest due to longer reasoning
100-
- **GPT-5.2** tends to repeat failing patterns and often omits writing output to files
101-
102-
### Previous Results (25 tasks)
136+
</details>
103137

104138
<details>
105-
<summary>2026-02-08 — Multi-Model Comparison</summary>
139+
<summary>2026-02-08 — Multi-Model Comparison (25 tasks)</summary>
106140

107141
| Metric | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
108142
|--------|-----------|----------|---------|

0 commit comments

Comments
 (0)