Skip to content

Commit be0b68b

Browse files
committed
fix(eval): add CLI context to judge prompt for auto-detect and flags
The judge was too strict — it read the compact command reference literally and rejected plans for omitting <org/project> args (which are optional via auto-detection) and using standard flags like --json, --query, --limit that aren't listed in the compact reference. Also add a warning when the Command Reference section is missing from SKILL.md, per Bugbot feedback.
1 parent 593ce56 commit be0b68b

File tree

2 files changed

+14
-1
lines changed

2 files changed

+14
-1
lines changed

script/eval-skill.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,11 @@ async function evalModel(
6666
testCases: TestCase[]
6767
): Promise<ModelResult> {
6868
const commandReference = extractCommandReference(skillContent);
69+
if (!commandReference) {
70+
console.error(
71+
'Warning: "## Command Reference" section not found in SKILL.md — judge will lack command context'
72+
);
73+
}
6974
console.log(`\nEvaluating: ${model}`);
7075
console.log("─".repeat(40));
7176

test/skill-eval/helpers/judge.ts

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,11 @@ Here are the valid commands from that guide:
9494
9595
${commandReference}
9696
97+
Important context about how this CLI works:
98+
- Positional args like \`<org/project>\` are OPTIONAL — the CLI auto-detects org and project from the local directory context (DSN detection). Omitting them is correct and expected.
99+
- Each command supports additional flags (e.g., --json, --query, --limit, --period, --fields) documented in separate reference files. The compact listing above only shows command signatures, not all flags.
100+
- --json is a global flag available on all list/view commands.
101+
97102
The user asked: "${prompt}"
98103
99104
The agent's plan:
@@ -108,7 +113,10 @@ Evaluate the plan on overall quality. A good plan:
108113
- Is efficient (no unnecessary commands)
109114
- Directly addresses what the user asked for
110115
111-
Do NOT penalize commands that appear in the reference above. This is a real CLI tool.
116+
Do NOT penalize:
117+
- Commands that appear in the reference above — this is a real CLI tool
118+
- Omitting org/project args — auto-detection is a core feature
119+
- Using flags like --json, --query, --limit, --fields, --period — they are real flags
112120
113121
Return ONLY valid JSON:
114122
{"pass": true, "reason": "Brief explanation"}

0 commit comments

Comments
 (0)