fix(eval): add CLI context to judge prompt for auto-detect and flags

BYK · BYK · commit be0b68b55b45 · 2026-04-10T10:27:41.000Z
The judge was too strict — it read the compact command reference literally
and rejected plans for omitting &lt;org/project&gt; args (which are optional via
auto-detection) and using standard flags like --json, --query, --limit that
aren't listed in the compact reference.

Also add a warning when the Command Reference section is missing from
SKILL.md, per Bugbot feedback.
diff --git a/script/eval-skill.ts b/script/eval-skill.ts
@@ -66,6 +66,11 @@ async function evalModel(
   testCases: TestCase[]
 ): Promise<ModelResult> {
   const commandReference = extractCommandReference(skillContent);
+  if (!commandReference) {
+    console.error(
+      'Warning: "## Command Reference" section not found in SKILL.md — judge will lack command context'
+    );
+  }
   console.log(`\nEvaluating: ${model}`);
   console.log("─".repeat(40));
 
diff --git a/test/skill-eval/helpers/judge.ts b/test/skill-eval/helpers/judge.ts
@@ -94,6 +94,11 @@ Here are the valid commands from that guide:
 
 ${commandReference}
 
+Important context about how this CLI works:
+- Positional args like \`<org/project>\` are OPTIONAL — the CLI auto-detects org and project from the local directory context (DSN detection). Omitting them is correct and expected.
+- Each command supports additional flags (e.g., --json, --query, --limit, --period, --fields) documented in separate reference files. The compact listing above only shows command signatures, not all flags.
+- --json is a global flag available on all list/view commands.
+
 The user asked: "${prompt}"
 
 The agent's plan:
@@ -108,7 +113,10 @@ Evaluate the plan on overall quality. A good plan:
 - Is efficient (no unnecessary commands)
 - Directly addresses what the user asked for
 
-Do NOT penalize commands that appear in the reference above. This is a real CLI tool.
+Do NOT penalize:
+- Commands that appear in the reference above — this is a real CLI tool
+- Omitting org/project args — auto-detection is a core feature
+- Using flags like --json, --query, --limit, --fields, --period — they are real flags
 
 Return ONLY valid JSON:
 {"pass": true, "reason": "Brief explanation"}