Commit 593ce56
committed
fix(eval): ground LLM judge with command reference to prevent false negatives
The skill eval judge (Haiku 4.5) had no context about the sentry CLI and
was hallucinating that valid commands don't exist, confusing it with the
legacy sentry-cli. This caused Opus 4.6 to fail 3/8 eval cases (62.5%,
below the 75% threshold) on the overall-quality criterion.
Extract the Command Reference section from SKILL.md and inject it into the
judge prompt so it can verify planned commands against actual CLI capabilities.1 parent 33b21a9 commit 593ce56
2 files changed
+37
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
49 | 61 | | |
50 | 62 | | |
51 | 63 | | |
52 | 64 | | |
53 | 65 | | |
54 | 66 | | |
55 | 67 | | |
| 68 | + | |
56 | 69 | | |
57 | 70 | | |
58 | 71 | | |
| |||
67 | 80 | | |
68 | 81 | | |
69 | 82 | | |
70 | | - | |
| 83 | + | |
71 | 84 | | |
72 | 85 | | |
73 | 86 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
77 | | - | |
| 77 | + | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
81 | 82 | | |
82 | | - | |
| 83 | + | |
| 84 | + | |
83 | 85 | | |
84 | 86 | | |
85 | 87 | | |
86 | 88 | | |
87 | 89 | | |
88 | 90 | | |
89 | 91 | | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
90 | 97 | | |
91 | 98 | | |
92 | 99 | | |
| |||
96 | 103 | | |
97 | 104 | | |
98 | 105 | | |
99 | | - | |
| 106 | + | |
100 | 107 | | |
101 | 108 | | |
102 | 109 | | |
103 | 110 | | |
| 111 | + | |
| 112 | + | |
104 | 113 | | |
105 | 114 | | |
106 | 115 | | |
| |||
146 | 155 | | |
147 | 156 | | |
148 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
149 | 161 | | |
150 | 162 | | |
151 | 163 | | |
152 | 164 | | |
153 | | - | |
| 165 | + | |
| 166 | + | |
154 | 167 | | |
155 | 168 | | |
156 | 169 | | |
| |||
181 | 194 | | |
182 | 195 | | |
183 | 196 | | |
184 | | - | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
185 | 203 | | |
186 | 204 | | |
187 | 205 | | |
| |||
0 commit comments