Skip to content

Commit 430083d

Browse files
authored
fix(eval): ground LLM judge with command reference to prevent false negatives (#712)
## Summary - The skill eval LLM judge (Haiku 4.5) had zero context about the `sentry` CLI, causing it to hallucinate that valid commands don't exist (confusing it with the legacy `sentry-cli`) - This caused Opus 4.6 to fail 3/8 eval cases (62.5%, below the 75% threshold) — all on the `overall-quality` LLM judge criterion, not deterministic checks - Fix: extract the Command Reference section from SKILL.md and inject it into the judge prompt as grounding context Failing CI run: https://github.com/getsentry/cli/actions/runs/24207303049/job/70666509005
1 parent 3ed0f1d commit 430083d

File tree

7 files changed

+296
-72
lines changed

7 files changed

+296
-72
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -142,61 +142,6 @@ jobs:
142142
echo "::error::Skill files are out of date. Run 'bun run generate:docs' locally and commit the result."
143143
exit 1
144144
145-
eval-skill:
146-
name: Eval SKILL.md
147-
needs: [changes]
148-
if: needs.changes.outputs.skill == 'true'
149-
runs-on: ubuntu-latest
150-
steps:
151-
# For fork PRs: check if eval has already passed via commit status
152-
- name: Detect fork
153-
id: detect-fork
154-
run: |
155-
if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" != "${{ github.repository }}" ]]; then
156-
echo "is_fork=true" >> "$GITHUB_OUTPUT"
157-
fi
158-
- name: Check fork eval status
159-
if: steps.detect-fork.outputs.is_fork == 'true'
160-
env:
161-
GH_TOKEN: ${{ github.token }}
162-
run: |
163-
SHA="${{ github.event.pull_request.head.sha }}"
164-
STATUS=$(gh api "repos/${{ github.repository }}/commits/$SHA/statuses" \
165-
--jq '[.[] | select(.context == "eval-skill/fork")] | first | .state // "none"')
166-
if [[ "$STATUS" != "success" ]]; then
167-
echo "::error::Fork PR modifies skill files but eval has not passed for commit $SHA."
168-
echo "::error::A maintainer must review the code and add the 'eval-skill' label."
169-
exit 1
170-
fi
171-
echo "Fork eval passed for $SHA"
172-
# For internal PRs: run the eval directly
173-
- uses: actions/checkout@v6
174-
if: steps.detect-fork.outputs.is_fork != 'true'
175-
- uses: oven-sh/setup-bun@v2
176-
if: steps.detect-fork.outputs.is_fork != 'true'
177-
- uses: actions/cache@v5
178-
if: steps.detect-fork.outputs.is_fork != 'true'
179-
id: cache
180-
with:
181-
path: node_modules
182-
key: node-modules-${{ hashFiles('bun.lock', 'patches/**') }}
183-
- if: steps.detect-fork.outputs.is_fork != 'true' && steps.cache.outputs.cache-hit != 'true'
184-
run: bun install --frozen-lockfile
185-
- name: Generate docs and skill files
186-
if: steps.detect-fork.outputs.is_fork != 'true'
187-
run: bun run generate:schema && bun run generate:docs
188-
- name: Eval SKILL.md
189-
if: steps.detect-fork.outputs.is_fork != 'true'
190-
run: bun run eval:skill
191-
env:
192-
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
193-
- name: Upload eval results
194-
if: always() && steps.detect-fork.outputs.is_fork != 'true'
195-
uses: actions/upload-artifact@v7
196-
with:
197-
name: skill-eval-results
198-
path: test/skill-eval/results.json
199-
200145
lint:
201146
name: Lint & Typecheck
202147
needs: [changes]
@@ -597,7 +542,7 @@ jobs:
597542
598543
test-e2e:
599544
name: E2E Tests
600-
needs: [build-binary]
545+
needs: [build-binary, changes]
601546
runs-on: ubuntu-latest
602547
steps:
603548
- uses: actions/checkout@v6
@@ -621,6 +566,9 @@ jobs:
621566
- name: E2E Tests
622567
env:
623568
SENTRY_CLI_BINARY: ${{ github.workspace }}/dist-bin/sentry-linux-x64
569+
# Pass API key only when skill files changed — the skill-eval e2e test
570+
# auto-skips when the key is absent, so non-skill PRs aren't affected.
571+
ANTHROPIC_API_KEY: ${{ needs.changes.outputs.skill == 'true' && secrets.ANTHROPIC_API_KEY || '' }}
624572
run: bun run test:e2e
625573

626574
build-npm:
@@ -726,15 +674,15 @@ jobs:
726674
ci-status:
727675
name: CI Status
728676
if: always()
729-
needs: [changes, check-generated, eval-skill, build-binary, build-npm, build-docs, test-e2e, generate-patches, publish-nightly]
677+
needs: [changes, check-generated, build-binary, build-npm, build-docs, test-e2e, generate-patches, publish-nightly]
730678
runs-on: ubuntu-latest
731679
permissions: {}
732680
steps:
733681
- name: Check CI status
734682
run: |
735683
# Check for explicit failures or cancellations in all jobs
736684
# generate-patches and publish-nightly are skipped on PRs — that's expected
737-
results="${{ needs.check-generated.result }} ${{ needs.eval-skill.result }} ${{ needs.build-binary.result }} ${{ needs.build-npm.result }} ${{ needs.build-docs.result }} ${{ needs.test-e2e.result }} ${{ needs.generate-patches.result }} ${{ needs.publish-nightly.result }}"
685+
results="${{ needs.check-generated.result }} ${{ needs.build-binary.result }} ${{ needs.build-npm.result }} ${{ needs.build-docs.result }} ${{ needs.test-e2e.result }} ${{ needs.generate-patches.result }} ${{ needs.publish-nightly.result }}"
738686
for result in $results; do
739687
if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then
740688
echo "::error::CI failed"
@@ -752,9 +700,4 @@ jobs:
752700
echo "::error::CI failed - upstream job failed causing check-generated to be skipped"
753701
exit 1
754702
fi
755-
if [[ "${{ needs.changes.outputs.skill }}" == "true" && "${{ needs.eval-skill.result }}" == "skipped" ]]; then
756-
echo "::error::CI failed - upstream job failed causing eval-skill to be skipped"
757-
exit 1
758-
fi
759-
760703
echo "CI passed"

.github/workflows/eval-skill-fork.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ jobs:
4646
- if: steps.cache.outputs.cache-hit != 'true'
4747
run: bun install --frozen-lockfile
4848

49+
- name: Generate docs and skill files
50+
run: bun run generate:schema && bun run generate:docs
51+
4952
- name: Eval SKILL.md
5053
id: eval
5154
run: bun run eval:skill

0 commit comments

Comments
 (0)