@@ -46,9 +46,83 @@ Smoke test dataset (`data/smoke-test.jsonl`) has 3 tasks for quick verification.
4646
4747## Results
4848
49- ### 2026-02-09 — Expanded Dataset (37 tasks, latest)
49+ ### 2026-02-17 — Sonnet 4 Baseline (37 tasks, latest)
50+
51+ First eval run with Claude Sonnet 4. Sonnet matches Haiku's pass rate (32/37) while achieving
52+ the highest tool call success rate (89%) of any model tested. Notably fixes ` data_column_transform `
53+ and ` complex_diff_report ` that tripped up other models, but shares the same systemic bashkit-bug
54+ failures (` text_csv_revenue ` , ` script_function_lib ` , ` complex_markdown_toc ` ).
55+
56+ | Metric | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
57+ | --------| ----------| -----------| ----------| ---------|
58+ | Tasks passed | 32/37 | 32/37 | 29/37 | 23/37 |
59+ | Score | 93% | ** 95%** | 87% | 80% |
60+ | Tool calls | 182 (162 ok, 20 err) | 150 (121 ok, 29 err) | 198 (163 ok, 35 err) | 108 (77 ok, 31 err) |
61+ | Tool call success | ** 89%** | 81% | 82% | 71% |
62+ | Tokens | 248K in / 30K out | 286K in / 35K out | 315K in / 31K out | 119K in / 17K out |
63+ | Duration | 10.2 min | 6.4 min | 25.2 min | 4.8 min |
5064
51- Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, schema migration, JSON→CSV, package.json update, group-by aggregation) and 6 gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join). Removed tool-steering from all prompts. Renamed ` jq_mastery ` → ` json_processing ` .
65+ #### Per-Category Comparison
66+
67+ | Category | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
68+ | ----------| ----------| -----------| ----------| ---------|
69+ | archive_operations | 100% | 100% | 100% | 17% |
70+ | complex_tasks | 62% | 92% | 54% | 67% |
71+ | data_transformation | ** 100%** | 93% | 90% | 90% |
72+ | error_recovery | 100% | 100% | 100% | 100% |
73+ | file_operations | 100% | 100% | 100% | 100% |
74+ | json_processing | 96% | 92% | 91% | 89% |
75+ | pipelines | 100% | 100% | 100% | 80% |
76+ | scripting | 95% | 95% | 95% | 53% |
77+ | system_info | 100% | 100% | 100% | 100% |
78+ | text_processing | 92% | 92% | 69% | 69% |
79+
80+ #### Cross-Model Failure Analysis
81+
82+ | Task | Sonnet 4 | Haiku 4.5 | Opus 4.6 | GPT-5.2 | Root Cause |
83+ | ------| ----------| -----------| ----------| ---------| ------------|
84+ | text_csv_revenue | FAIL | FAIL | PASS | PASS | bashkit ` awk ` arithmetic bug |
85+ | script_function_lib | FAIL | FAIL | FAIL | FAIL | bashkit ` tr ` character class bug |
86+ | complex_markdown_toc | FAIL | FAIL | FAIL | FAIL | bashkit pipe-to-while-loop + turn budget |
87+ | json_to_csv_export | FAIL | FAIL | PASS | FAIL | jq ` @csv ` quoting vs eval expectations |
88+ | complex_release_notes | FAIL | PASS | FAIL | FAIL | bashkit ` grep ` /` sed ` /` awk ` regex bugs |
89+ | data_column_transform | PASS | FAIL | FAIL | PASS | model-specific |
90+ | data_csv_to_json | PASS | PASS | FAIL | PASS | model-specific |
91+ | complex_todo_app | PASS | PASS | FAIL | PASS | model-specific |
92+ | json_config_merge | PASS | PASS | FAIL | PASS | model-specific |
93+ | text_multifile_replace | PASS | PASS | FAIL | FAIL | model-specific |
94+
95+ Two tasks (` script_function_lib ` , ` complex_markdown_toc ` ) fail across ** all four models** — these
96+ are bashkit interpreter limitations, not model weaknesses. Three more fail on 3/4 models
97+ (` json_to_csv_export ` , ` complex_release_notes ` , ` text_csv_revenue ` ), also driven by interpreter bugs.
98+
99+ #### Bashkit Interpreter Bugs Surfaced
100+
101+ | Bug | Affected Tasks | Impact |
102+ | -----| ---------------| --------|
103+ | ` tr '[:lower:]' '[:upper:]' ` produces empty output from pipe | script_function_lib | Blocks all models |
104+ | Variables empty inside ` while read ` in pipe subshell | complex_markdown_toc | Blocks all models |
105+ | ` awk ` ` $2 * $3 ` accumulation returns wrong result | text_csv_revenue | Wrong math (204 vs 329) |
106+ | ` grep ` treats ` ( ` as ERE metachar in default BRE mode | complex_release_notes | Requires ` \( ` escaping |
107+ | ` sed ` capture group substitution ` \1 ` /` \2 ` has no effect | complex_release_notes | Silent no-op |
108+ | ` awk match() ` with capture array unsupported | complex_release_notes, complex_markdown_toc | Error on valid GNU awk |
109+ | ` tail -n +N ` returns wrong content | complex_markdown_toc | Returns only last section |
110+ | Script execution via ` chmod +x ` + path fails | complex_release_notes | "command not found" |
111+
112+ #### Model Behavior
113+
114+ - ** Sonnet 4** highest tool call success rate (89%); efficient token usage; shares Haiku's failure
115+ profile on bashkit-bug tasks; struggles on ` complex_release_notes ` due to cascading interpreter bugs
116+ - ** Haiku 4.5** best score/cost ratio (95% score, fastest) — adapts to bashkit quirks, retries with simpler constructs
117+ - ** Opus 4.6** struggles on multi-step complex_tasks (54%) but strong on JSON processing; slowest due to longer reasoning
118+ - ** GPT-5.2** tends to repeat failing patterns and often omits writing output to files
119+
120+ ### Previous Results
121+
122+ <details >
123+ <summary >2026-02-09 — Expanded Dataset (37 tasks)</summary >
124+
125+ Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, schema migration, JSON-CSV, package.json update, group-by aggregation) and 6 gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join). Removed tool-steering from all prompts. Renamed ` jq_mastery ` to ` json_processing ` .
52126
53127| Metric | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
54128| --------| -----------| ----------| ---------|
@@ -59,50 +133,10 @@ Added 12 new scenarios: 6 JSON processing (config merge, NDJSON aggregation, sch
59133| Tokens | 286K in / 35K out | 315K in / 31K out | 119K in / 17K out |
60134| Duration | 6.4 min | 25.2 min | 4.8 min |
61135
62- #### Per-Category Comparison
63-
64- | Category | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
65- | ----------| -----------| ----------| ---------|
66- | archive_operations | 100% | 100% | 17% |
67- | complex_tasks | 92% | 54% | 67% |
68- | data_transformation | 93% | 90% | 90% |
69- | error_recovery | 100% | 100% | 100% |
70- | file_operations | 100% | 100% | 100% |
71- | json_processing | 92% | 91% | 89% |
72- | pipelines | 100% | 100% | 80% |
73- | scripting | 95% | 95% | 53% |
74- | system_info | 100% | 100% | 100% |
75- | text_processing | 92% | 69% | 69% |
76-
77- #### New Scenario Performance
78-
79- | Task | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
80- | ------| -----------| ----------| ---------|
81- | json_config_merge | PASS | FAIL | PASS |
82- | json_ndjson_error_aggregate | PASS | PASS | PASS |
83- | json_api_schema_migration | PASS | PASS | PASS |
84- | json_to_csv_export | FAIL | PASS | FAIL |
85- | json_package_update | PASS | PASS | FAIL |
86- | json_order_totals | PASS | PASS | PASS |
87- | pipe_dedup_merge | PASS | PASS | FAIL |
88- | text_multifile_replace | PASS | FAIL | FAIL |
89- | script_health_check | PASS | PASS | PASS |
90- | data_column_transform | FAIL | FAIL | PASS |
91- | complex_release_notes | PASS | FAIL | FAIL |
92- | data_csv_join | PASS | PASS | PASS |
93-
94- No single new scenario fails across all three models — failures are model-specific, not bashkit limitations. ` data_column_transform ` and ` text_multifile_replace ` trip up two of three models each.
95-
96- #### Model Behavior
97-
98- - ** Haiku 4.5** remains the best score/cost ratio — adapts to bashkit quirks, retries with simpler constructs
99- - ** Opus 4.6** struggles on multi-step complex_tasks (54%) but strong on JSON processing; slowest due to longer reasoning
100- - ** GPT-5.2** tends to repeat failing patterns and often omits writing output to files
101-
102- ### Previous Results (25 tasks)
136+ </details >
103137
104138<details >
105- <summary >2026-02-08 — Multi-Model Comparison</summary >
139+ <summary >2026-02-08 — Multi-Model Comparison (25 tasks) </summary >
106140
107141| Metric | Haiku 4.5 | Opus 4.6 | GPT-5.2 |
108142| --------| -----------| ----------| ---------|
0 commit comments