Only score based on surfaced violations only#69
Conversation
- Introduce RawCheckResult type: violations + word_count, no score fields - Evaluators (base, accuracy) return RawCheckResult; scoring deferred - Orchestrator calls calculateCheckScore after gate filtering, so only violations surfaced to the user contribute to the density penalty - Add regression test: 1 filtered violation must not move score from 9.0 to 8.0
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughEvaluators now return RawCheckResult objects containing violations and word_count instead of computed score fields. Scoring responsibility (calculateCheckScore) moved to the orchestrator, which computes severity, message, and final_score from RawCheckResult and attaches scored details to reporting outputs. Types and tests updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Orchestrator
participant Evaluator
participant Scorer
Client->>Orchestrator: Request evaluation
Orchestrator->>Evaluator: runCheckEvaluation(content)
Evaluator->>Evaluator: analyze content -> produce violations, word_count
Evaluator-->>Orchestrator: RawCheckResult(violations, word_count)
alt violations present
Orchestrator->>Scorer: calculateCheckScore(violations, word_count, options)
Scorer-->>Orchestrator: scoredResult(severity, message, final_score)
else no violations
Orchestrator->>Scorer: calculateCheckScore([], word_count, options)
Scorer-->>Orchestrator: scoredResult(default severity/message/final_score)
end
Orchestrator->>Orchestrator: attach scoredResult to Quality Scores output
Orchestrator-->>Client: Final evaluation (Judge or scored Check result)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/prompts/schema.ts (1)
370-380: Redundant type guards with misleading naming.
isCheckResultandisRawCheckResultare identical implementations. Additionally,isCheckResultnarrowing toRawCheckResultis semantically confusing sinceCheckResultis a distinct type (used bycalculateCheckScore).Consider either:
- Remove
isCheckResultand keep onlyisRawCheckResult(update call sites), or- Rename
isCheckResulttoisRawCheckResultand remove the duplicate♻️ Proposed fix: remove duplicate and clarify naming
-export function isCheckResult( - result: PromptEvaluationResult -): result is RawCheckResult { - return result.type === EvaluationType.CHECK; -} - export function isRawCheckResult( result: PromptEvaluationResult ): result is RawCheckResult { return result.type === EvaluationType.CHECK; }Then update the single call site in
orchestrator.tsto useisRawCheckResultor simply!isJudgeResult(result).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/prompts/schema.ts` around lines 370 - 380, The two identical type guards isCheckResult and isRawCheckResult cause confusion; remove isCheckResult and keep a single clearly named guard isRawCheckResult (which narrows PromptEvaluationResult to RawCheckResult), then update all call sites (e.g., in orchestrator.ts) to use isRawCheckResult or !isJudgeResult(result) where appropriate; ensure references to calculateCheckScore and the distinct CheckResult type remain unchanged and that no usages expect isCheckResult to exist.src/evaluators/accuracy-evaluator.ts (1)
63-73: Outdated comment references removed fields.The comment mentions "empty items array, perfect score" but
RawCheckResultno longer hasitemsor a precomputed score. The code is correct; only the comment needs updating.📝 Proposed fix
- // If no claims found, return success (empty items array, perfect score) - // Use the scoring module to calculate result + // If no claims found, return success (empty violations) if (claims.length === 0) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/evaluators/accuracy-evaluator.ts` around lines 63 - 73, Update the outdated comment above the early-return branch that checks if (claims.length === 0): replace "empty items array, perfect score" with a short accurate description of current behavior (e.g., "no claims found — return RawCheckResult with no violations and word_count set"), and ensure references to countWords, RawCheckResult, EvaluationType.CHECK and optional claimUsage remain correct so the comment matches the code’s actual returned shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/evaluators/accuracy-evaluator.ts`:
- Around line 63-73: Update the outdated comment above the early-return branch
that checks if (claims.length === 0): replace "empty items array, perfect score"
with a short accurate description of current behavior (e.g., "no claims found —
return RawCheckResult with no violations and word_count set"), and ensure
references to countWords, RawCheckResult, EvaluationType.CHECK and optional
claimUsage remain correct so the comment matches the code’s actual returned
shape.
In `@src/prompts/schema.ts`:
- Around line 370-380: The two identical type guards isCheckResult and
isRawCheckResult cause confusion; remove isCheckResult and keep a single clearly
named guard isRawCheckResult (which narrows PromptEvaluationResult to
RawCheckResult), then update all call sites (e.g., in orchestrator.ts) to use
isRawCheckResult or !isJudgeResult(result) where appropriate; ensure references
to calculateCheckScore and the distinct CheckResult type remain unchanged and
that no usages expect isCheckResult to exist.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f9a53b24-b057-4261-98d8-be53dacbb838
📒 Files selected for processing (6)
src/cli/orchestrator.tssrc/evaluators/accuracy-evaluator.tssrc/evaluators/base-evaluator.tssrc/prompts/schema.tstests/orchestrator-filtering.test.tstests/scoring-types.test.ts
- promptSeverity !== undefined now wins unconditionally; previously only Severity.ERROR was treated as authoritative, so an explicit 'severity: warning' in rule frontmatter was silently overridden by DefaultSeverity=error from config - Remove unused isRawCheckResult type guard (duplicate of isCheckResult)
- Remove | string from CheckScoringOptions.promptSeverity; callers pass meta.severity which is Zod-validated as nativeEnum(Severity) at the config boundary, so the string widening was unnecessary - Drop the as-cast that was required to compensate
For
checktype rules, the quality score was calculated from all violations the LLM returned, before confidence and gate filtering ran. A file could show 0 displayed issues even though it received a penalized score, because filtered-out candidates still contributed to the density calculation.What
checkrule scores now reflect only violations that are surfaced to the userRawCheckResulttype introduced: evaluators return violations + word count without a pre-baked scorecalculateCheckScoremoved to the orchestrator, called once aftergetViolationFilterResults.severity: warningin rule frontmatter now wins overDefaultSeverity=errorfrom config (previously onlyseverity: errorwas treated as authoritative)promptSeveritytype tightened inCheckScoringOptions— removed| stringwidening that was unnecessary given Zod validation at the config boundaryScope
In scope
checktype rulesBaseEvaluator,TechnicalAccuracyEvaluator, and orchestrator wiringOut of scope
judgetype rules (unaffected — score is a direct LLM rating, not density-based)Behavior impact
10.0/10instead of a penalized scoreseverity: warningin frontmatter now correctly overrideDefaultSeverity=errorfrom configRisk
calculateCheckScoreis a pure function; only its call site and inputs changed. Thejudgepath is untouched.CheckResultremains inschema.tsfor any external consumers;PromptEvaluationResultnow usesRawCheckResultinternally.DefaultSeverity=errorin config andseverity: warningin rule frontmatter — violations from those rules will now surface as warnings rather than errors.How to test/verify
Checks run
Manual verification
Run a
checktype rule against a file where you expect the LLM to return some violations that fail the confidence gate. Before this change the score would be lower than what the displayed issue count implies. After this change, the score should match only what's shown.Rollback
Revert commits
53fcdc5,45e6905,a287fb8in order. No schema migrations or config changes involved.Summary by CodeRabbit