Skip to content

Commit 7b43a40

Browse files
committed
chore: merge main, resolve eval-tasks.jsonl conflict
Keep both our new tasks (code_search, environment, etc.) and main's new tasks (database_operations, config_management, build_simulation). Dataset now has 58 tasks across 15 categories. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
2 parents 7a302c5 + e350a99 commit 7b43a40

File tree

6 files changed

+139
-137
lines changed

6 files changed

+139
-137
lines changed

crates/bashkit-eval/data/eval-tasks.jsonl

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,9 @@
5050
{"id":"complex_test_output","category":"complex_tasks","description":"Parse test results to extract failures and generate summary report","prompt":"Read /data/test-results.txt which contains test output in a standard format. Parse it to: 1) Count total tests, passed, and failed. 2) Extract the names of all failing tests. 3) Generate a summary report at /reports/test-summary.md with a header '# Test Summary', a line 'Total: N | Passed: N | Failed: N', and a '## Failures' section listing each failed test. Print the summary.","files":{"/data/test-results.txt":"PASS test_login_valid\nPASS test_login_invalid_password\nFAIL test_login_expired_token\nPASS test_signup_new_user\nFAIL test_signup_duplicate_email\nPASS test_logout\nPASS test_password_reset\nFAIL test_session_timeout\nPASS test_profile_update\nPASS test_profile_delete\nPASS test_api_rate_limit\nPASS test_api_auth_header\n"},"expectations":[{"check":"file_exists:/reports/test-summary.md"},{"check":"file_contains:/reports/test-summary.md:# Test Summary"},{"check":"file_contains:/reports/test-summary.md:Total: 12"},{"check":"file_contains:/reports/test-summary.md:Passed: 9"},{"check":"file_contains:/reports/test-summary.md:Failed: 3"},{"check":"file_contains:/reports/test-summary.md:test_login_expired_token"},{"check":"file_contains:/reports/test-summary.md:test_signup_duplicate_email"},{"check":"file_contains:/reports/test-summary.md:test_session_timeout"},{"check":"stdout_contains:Failed: 3"},{"check":"exit_code:0"}]}
5151
{"id":"complex_debug_script","category":"complex_tasks","description":"Debug and fix a broken script using bash debugging features","prompt":"The script /scripts/broken.sh has bugs. Run it first to see the errors. Then examine the script, identify the bugs, fix them, and run the fixed version. The script should compute the factorial of 5 and print 'factorial(5) = 120'. The script should exit 0. Write the fixed version back to /scripts/broken.sh.","files":{"/scripts/broken.sh":"#!/bin/bash\n\nfactorial() {\n local n=$1\n if [ $n -le 1 ]; then\n echo 1\n else\n local sub=$(factorial $((n-1)))\n echo $((n * sub))\n fi\n}\n\nresult=$(factorial 5\necho \"factorial(5) = $result\"\nexit 0\n"},"expectations":[{"check":"stdout_contains:factorial(5) = 120"},{"check":"file_exists:/scripts/broken.sh"},{"check":"exit_code:0"},{"check":"tool_calls_min:2"}]}
5252
{"id":"data_regex_extract","category":"data_transformation","description":"Extract structured data from log entries using regex and BASH_REMATCH","prompt":"Read /data/access.log where each line has format '[TIMESTAMP] STATUS_CODE METHOD URL DURATION_MS'. Use bash regex matching ([[ =~ ]]) with BASH_REMATCH to extract each field. Find all requests that took longer than 500ms. Print them as 'SLOW: METHOD URL took DURATIONms (STATUS)'. At the end, print 'Slow requests: N of M total'.","files":{"/data/access.log":"[2024-01-15T10:00:01] 200 GET /api/users 150\n[2024-01-15T10:00:02] 200 POST /api/orders 850\n[2024-01-15T10:00:03] 404 GET /api/missing 50\n[2024-01-15T10:00:04] 200 GET /api/reports 1200\n[2024-01-15T10:00:05] 500 POST /api/payments 2000\n[2024-01-15T10:00:06] 200 GET /api/health 30\n[2024-01-15T10:00:07] 200 GET /api/products 450\n[2024-01-15T10:00:08] 200 PUT /api/users 620\n"},"expectations":[{"check":"stdout_contains:/api/orders"},{"check":"stdout_contains:/api/reports"},{"check":"stdout_contains:/api/payments"},{"check":"stdout_contains:620"},{"check":"stdout_regex:4.*8|4 of 8|4 slow"},{"check":"exit_code:0"}]}
53+
{"id":"db_csv_group_by","category":"database_operations","description":"GROUP BY with aggregation on CSV data","prompt":"Read /data/sales.csv with columns: region, product, amount. Compute total amount per region (like SQL GROUP BY region, SUM(amount)). Print results as 'region: total' sorted by total descending.","files":{"/data/sales.csv":"region,product,amount\nnorth,widgets,500\nsouth,gadgets,300\nnorth,bolts,200\neast,widgets,400\nsouth,widgets,350\nnorth,gadgets,150\neast,bolts,250\nsouth,bolts,100\n"},"expectations":[{"check":"stdout_contains:north"},{"check":"stdout_contains:850"},{"check":"stdout_contains:south"},{"check":"stdout_contains:750"},{"check":"stdout_contains:east"},{"check":"stdout_contains:650"},{"check":"exit_code:0"}]}
54+
{"id":"db_csv_join_aggregate","category":"database_operations","description":"Join two CSVs and compute per-group statistics","prompt":"Join /data/orders.csv and /data/products.csv on product_id. For each category, compute the total revenue (quantity * price). Print 'category: total_revenue' sorted by revenue descending.","files":{"/data/orders.csv":"order_id,product_id,quantity\n1,101,3\n2,102,5\n3,101,2\n4,103,1\n5,102,4\n6,103,3\n","/data/products.csv":"product_id,name,category,price\n101,Widget,hardware,25\n102,Gadget,electronics,50\n103,Bolt,hardware,10\n"},"expectations":[{"check":"stdout_contains:electronics"},{"check":"stdout_contains:450"},{"check":"stdout_contains:hardware"},{"check":"stdout_contains:165"},{"check":"exit_code:0"}]}
55+
{"id":"config_env_template","category":"config_management","description":"Generate .env file from template with defaults","prompt":"Read /config/template.env which has lines like 'KEY=${VALUE:-default}'. For each line, check if the key exists in /config/overrides.txt (format: KEY=value). If it does, use the override value; otherwise use the default from the template. Write the final KEY=value pairs to /app/.env and print the result.","files":{"/config/template.env":"DB_HOST=${DB_HOST:-localhost}\nDB_PORT=${DB_PORT:-5432}\nDB_NAME=${DB_NAME:-myapp}\nREDIS_URL=${REDIS_URL:-redis://localhost:6379}\nLOG_LEVEL=${LOG_LEVEL:-info}\n","/config/overrides.txt":"DB_HOST=db.prod.internal\nLOG_LEVEL=warn\n"},"expectations":[{"check":"file_exists:/app/.env"},{"check":"file_contains:/app/.env:DB_HOST=db.prod.internal"},{"check":"file_contains:/app/.env:DB_PORT=5432"},{"check":"file_contains:/app/.env:DB_NAME=myapp"},{"check":"file_contains:/app/.env:LOG_LEVEL=warn"},{"check":"stdout_contains:db.prod.internal"},{"check":"exit_code:0"}]}
56+
{"id":"config_ini_merge","category":"config_management","description":"Merge INI config files with section-aware override","prompt":"Merge /config/defaults.ini and /config/custom.ini. Custom values should override defaults within the same section. Keys only in defaults should be preserved. Write the merged result to /config/merged.ini and print it. Sections are denoted by [section_name] headers.","files":{"/config/defaults.ini":"[server]\nhost=0.0.0.0\nport=8080\nworkers=4\n\n[database]\nhost=localhost\nport=5432\npool_size=5\n\n[logging]\nlevel=info\nformat=text\n","/config/custom.ini":"[server]\nport=9090\nworkers=8\n\n[logging]\nlevel=debug\nformat=json\n"},"expectations":[{"check":"file_exists:/config/merged.ini"},{"check":"file_contains:/config/merged.ini:port=9090"},{"check":"file_contains:/config/merged.ini:workers=8"},{"check":"file_contains:/config/merged.ini:host=0.0.0.0"},{"check":"file_contains:/config/merged.ini:pool_size=5"},{"check":"file_contains:/config/merged.ini:level=debug"},{"check":"exit_code:0"}]}
57+
{"id":"build_multi_stage","category":"build_simulation","description":"Multi-stage build pipeline with dependency checking","prompt":"Simulate a build pipeline: 1) Check that /src/main.c and /src/utils.c exist, 2) 'Compile' each .c file by copying it to /build/ with a .o extension and prepending '// compiled', 3) 'Link' by concatenating all .o files into /build/program with a '// linked' header, 4) 'Package' by creating a tar.gz of /build/ at /dist/release.tar.gz. Print status for each stage. If any stage fails, stop and report the error.","files":{"/src/main.c":"int main() { return helper(); }\n","/src/utils.c":"int helper() { return 0; }\n"},"expectations":[{"check":"file_exists:/build/main.o"},{"check":"file_exists:/build/utils.o"},{"check":"file_exists:/build/program"},{"check":"file_contains:/build/program:compiled"},{"check":"file_exists:/dist/release.tar.gz"},{"check":"exit_code:0"}]}
58+
{"id":"build_script_generator","category":"build_simulation","description":"Generate a Makefile-like build script from dependency spec","prompt":"Read /project/deps.txt which lists build targets and their dependencies (format: 'target: dep1 dep2'). Generate /project/build.sh that builds targets in correct dependency order (dependencies before dependents). Then run the build script, which should create each target as a file in /project/out/ containing 'built: <target>'. Print the build order.","files":{"/project/deps.txt":"app: lib utils\nlib: core\nutils: core\ncore:\n","/project/src/core":"core source\n","/project/src/lib":"lib source\n","/project/src/utils":"utils source\n","/project/src/app":"app source\n"},"expectations":[{"check":"file_exists:/project/build.sh"},{"check":"file_exists:/project/out/core"},{"check":"file_exists:/project/out/lib"},{"check":"file_exists:/project/out/app"},{"check":"exit_code:0"}]}

crates/bashkit/src/builtins/wc.rs

Lines changed: 51 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,20 @@ impl WcFlags {
7878
max_line_length,
7979
}
8080
}
81+
82+
/// Number of active count fields
83+
fn active_count(&self) -> usize {
84+
[
85+
self.lines,
86+
self.words,
87+
self.bytes,
88+
self.chars,
89+
self.max_line_length,
90+
]
91+
.iter()
92+
.filter(|&&b| b)
93+
.count()
94+
}
8195
}
8296

8397
#[async_trait]
@@ -102,7 +116,9 @@ impl Builtin for Wc {
102116
// Read from stdin
103117
if let Some(stdin) = ctx.stdin {
104118
let counts = count_text(stdin);
105-
output.push_str(&format_counts(&counts, &flags, None));
119+
// Real bash: no padding for single-value stdin, padded for multiple values
120+
let padded = flags.active_count() > 1;
121+
output.push_str(&format_counts(&counts, &flags, None, padded));
106122
output.push('\n');
107123
}
108124
} else {
@@ -127,7 +143,7 @@ impl Builtin for Wc {
127143
total_max_line = counts.max_line_length;
128144
}
129145

130-
output.push_str(&format_counts(&counts, &flags, Some(file)));
146+
output.push_str(&format_counts(&counts, &flags, Some(file), true));
131147
output.push('\n');
132148
}
133149
Err(e) => {
@@ -145,7 +161,12 @@ impl Builtin for Wc {
145161
chars: total_chars,
146162
max_line_length: total_max_line,
147163
};
148-
output.push_str(&format_counts(&totals, &flags, Some(&"total".to_string())));
164+
output.push_str(&format_counts(
165+
&totals,
166+
&flags,
167+
Some(&"total".to_string()),
168+
true,
169+
));
149170
output.push('\n');
150171
}
151172
}
@@ -178,32 +199,47 @@ fn count_text(text: &str) -> TextCounts {
178199
}
179200
}
180201

181-
/// Format counts for output
182-
fn format_counts(counts: &TextCounts, flags: &WcFlags, filename: Option<&String>) -> String {
183-
let mut parts = Vec::new();
202+
/// Format counts for output.
203+
/// When `padded` is true, right-align numbers in 8-char fields (used for file output).
204+
/// When `padded` is false, use minimal formatting like real bash stdin output.
205+
fn format_counts(
206+
counts: &TextCounts,
207+
flags: &WcFlags,
208+
filename: Option<&String>,
209+
padded: bool,
210+
) -> String {
211+
let mut values: Vec<usize> = Vec::new();
184212

185213
if flags.lines {
186-
parts.push(format!("{:>8}", counts.lines));
214+
values.push(counts.lines);
187215
}
188216
if flags.words {
189-
parts.push(format!("{:>8}", counts.words));
217+
values.push(counts.words);
190218
}
191219
if flags.bytes {
192-
parts.push(format!("{:>8}", counts.bytes));
220+
values.push(counts.bytes);
193221
}
194222
if flags.chars {
195-
parts.push(format!("{:>8}", counts.chars));
223+
values.push(counts.chars);
196224
}
197225
if flags.max_line_length {
198-
parts.push(format!("{:>8}", counts.max_line_length));
226+
values.push(counts.max_line_length);
199227
}
200228

201-
let mut result = parts.join("");
229+
let result = if padded {
230+
// Real bash uses 7-char wide fields separated by a space
231+
let parts: Vec<String> = values.iter().map(|v| format!("{:>7}", v)).collect();
232+
parts.join(" ")
233+
} else {
234+
let parts: Vec<String> = values.iter().map(|v| v.to_string()).collect();
235+
parts.join(" ")
236+
};
237+
202238
if let Some(name) = filename {
203-
result.push(' ');
204-
result.push_str(name);
239+
format!("{} {}", result, name)
240+
} else {
241+
result
205242
}
206-
result
207243
}
208244

209245
#[cfg(test)]

crates/bashkit/src/error.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ pub enum Error {
4949
#[error("network error: {0}")]
5050
Network(String),
5151

52+
/// Regex compilation or matching error.
53+
#[error("regex error: {0}")]
54+
Regex(#[from] regex::Error),
55+
5256
/// Internal error for unexpected failures.
5357
///
5458
/// THREAT[TM-INT-002]: Unexpected internal failures should not crash the interpreter.

crates/bashkit/src/tool.rs

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,21 @@ use std::sync::{Arc, Mutex};
4242
/// Library version from Cargo.toml
4343
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
4444

45-
/// List of built-in commands
46-
const BUILTINS: &str = "echo cat grep sed awk jq curl head tail sort uniq cut tr wc date sleep mkdir rm cp mv touch chmod printf test [ true false exit cd pwd ls find xargs basename dirname env export read";
45+
/// List of built-in commands (organized by category)
46+
const BUILTINS: &str = "\
47+
echo printf cat read \
48+
grep sed awk jq head tail sort uniq cut tr wc nl paste column comm diff strings tac rev \
49+
cd pwd ls find mkdir mktemp rm rmdir cp mv touch chmod chown ln \
50+
file stat less tar gzip gunzip du df \
51+
test [ true false exit return break continue \
52+
export set unset local shift source eval declare typeset readonly shopt getopts \
53+
sleep date seq expr yes wait timeout xargs tee watch \
54+
basename dirname realpath \
55+
pushd popd dirs \
56+
whoami hostname uname id env printenv history \
57+
curl wget \
58+
od xxd hexdump base64 \
59+
kill";
4760

4861
/// Base help documentation template (generic help format)
4962
const BASE_HELP: &str = r#"BASH(1) User Commands BASH(1)
@@ -62,10 +75,22 @@ DESCRIPTION
6275
loops, conditionals, functions, and arrays.
6376
6477
BUILTINS
65-
echo, cat, grep, sed, awk, jq, curl, head, tail, sort, uniq, cut, tr,
66-
wc, date, sleep, mkdir, rm, cp, mv, touch, chmod, printf, test, [,
67-
true, false, exit, cd, pwd, ls, find, xargs, basename, dirname, env,
68-
export, read
78+
Core I/O: echo, printf, cat, read
79+
Text Processing: grep, sed, awk, jq, head, tail, sort, uniq, cut, tr, wc,
80+
nl, paste, column, comm, diff, strings, tac, rev
81+
File Operations: cd, pwd, ls, find, mkdir, mktemp, rm, rmdir, cp, mv,
82+
touch, chmod, chown, ln
83+
File Inspection: file, stat, less, tar, gzip, gunzip, du, df
84+
Flow Control: test, [, true, false, exit, return, break, continue
85+
Shell/Variables: export, set, unset, local, shift, source, eval, declare,
86+
typeset, readonly, shopt, getopts
87+
Utilities: sleep, date, seq, expr, yes, wait, timeout, xargs, tee,
88+
watch, basename, dirname, realpath
89+
Dir Stack: pushd, popd, dirs
90+
System Info: whoami, hostname, uname, id, env, printenv, history
91+
Network: curl, wget
92+
Binary/Hex: od, xxd, hexdump, base64
93+
Signals: kill
6994
7095
INPUT
7196
commands Bash commands to execute (like bash -c "commands")
@@ -671,6 +696,7 @@ fn error_kind(e: &Error) -> String {
671696
Error::CommandNotFound(_) => "command_not_found".to_string(),
672697
Error::ResourceLimit(_) => "resource_limit".to_string(),
673698
Error::Network(_) => "network_error".to_string(),
699+
Error::Regex(_) => "regex_error".to_string(),
674700
Error::Internal(_) => "internal_error".to_string(),
675701
}
676702
}
Lines changed: 22 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,158 +1,139 @@
11
### wc_lines_only
2-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
32
# Count lines with -l
43
printf 'a\nb\nc\n' | wc -l
54
### expect
6-
3
5+
3
76
### end
87

98
### wc_words_only
10-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
119
# Count words with -w
1210
printf 'one two three four five' | wc -w
1311
### expect
14-
5
12+
5
1513
### end
1614

1715
### wc_bytes_only
18-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
1916
# Count bytes with -c
2017
printf 'hello' | wc -c
2118
### expect
22-
5
19+
5
2320
### end
2421

2522
### wc_empty
26-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
2723
# Empty input
2824
printf '' | wc -l
2925
### expect
30-
0
26+
0
3127
### end
3228

3329
### wc_all_flags
34-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
3530
# All counts (default)
3631
printf 'hello world\n' | wc
3732
### expect
38-
1 2 12
33+
1 2 12
3934
### end
4035

4136
### wc_multiple_lines
42-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
4337
# Multiple lines
4438
printf 'one\ntwo\nthree\n' | wc -l
4539
### expect
46-
3
40+
3
4741
### end
4842

4943
### wc_chars_m_flag
50-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
5144
# Count characters with -m
5245
printf 'hello' | wc -m
5346
### expect
54-
5
47+
5
5548
### end
5649

5750
### wc_lines_words
58-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
5951
# Lines and words combined
6052
printf 'one two\nthree four\n' | wc -lw
6153
### expect
62-
2 4
54+
2 4
6355
### end
6456

6557
### wc_no_newline_at_end
66-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
6758
# Input without trailing newline
6859
printf 'hello world' | wc -w
6960
### expect
70-
2
61+
2
7162
### end
7263

7364
### wc_multiple_spaces
74-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
7565
# Multiple spaces between words
7666
printf 'hello world' | wc -w
7767
### expect
78-
2
68+
2
7969
### end
8070

8171
### wc_tabs_count
82-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
8372
# Tabs in input
8473
printf 'a\tb\tc' | wc -w
8574
### expect
86-
3
75+
3
8776
### end
8877

8978
### wc_single_word
90-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
9179
# Single word
9280
printf 'word' | wc -w
9381
### expect
94-
1
82+
1
9583
### end
9684

9785
### wc_only_whitespace
98-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
9986
# Only whitespace
10087
printf ' \t ' | wc -w
10188
### expect
102-
0
89+
0
10390
### end
10491

10592
### wc_max_line_length
106-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
10793
printf 'short\nlongerline\n' | wc -L
10894
### expect
109-
10
95+
10
11096
### end
11197

11298
### wc_long_flags
113-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
11499
# Long flag --lines
115100
printf 'a\nb\n' | wc --lines
116101
### expect
117-
2
102+
2
118103
### end
119104

120105
### wc_long_words
121-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
122106
# Long flag --words
123107
printf 'one two three' | wc --words
124108
### expect
125-
3
109+
3
126110
### end
127111

128112
### wc_long_bytes
129-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
130113
# Long flag --bytes
131114
printf 'hello' | wc --bytes
132115
### expect
133-
5
116+
5
134117
### end
135118

136119
### wc_bytes_vs_chars
137-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
138120
# Bytes vs chars for ASCII
139121
printf 'hello' | wc -c && printf 'hello' | wc -m
140122
### expect
141-
5
142-
5
123+
5
124+
5
143125
### end
144126

145127
### wc_unicode_chars
146-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
128+
### bash_diff: locale-dependent; real bash wc -m may count bytes in C locale
147129
printf 'héllo' | wc -m
148130
### expect
149-
5
131+
5
150132
### end
151133

152134
### wc_unicode_bytes
153-
### bash_diff: Bashkit wc uses fixed-width padding for stdin, real bash uses no padding
154135
# Unicode byte count
155136
printf 'héllo' | wc -c
156137
### expect
157-
6
138+
6
158139
### end

0 commit comments

Comments
 (0)