Skip to content

Only score based on surfaced violations only#69

Merged
oshorefueled merged 3 commits intomainfrom
fix/penalize-score
Mar 4, 2026
Merged

Only score based on surfaced violations only#69
oshorefueled merged 3 commits intomainfrom
fix/penalize-score

Conversation

@oshorefueled
Copy link
Contributor

@oshorefueled oshorefueled commented Mar 4, 2026

For check type rules, the quality score was calculated from all violations the LLM returned, before confidence and gate filtering ran. A file could show 0 displayed issues even though it received a penalized score, because filtered-out candidates still contributed to the density calculation.

What

  • check rule scores now reflect only violations that are surfaced to the user
  • RawCheckResult type introduced: evaluators return violations + word count without a pre-baked score
  • calculateCheckScore moved to the orchestrator, called once after getViolationFilterResults.
  • Scorer priority fixed: explicit severity: warning in rule frontmatter now wins over DefaultSeverity=error from config (previously only severity: error was treated as authoritative)
  • promptSeverity type tightened in CheckScoringOptions — removed | string widening that was unnecessary given Zod validation at the config boundary

Scope

In scope

  • Score calculation for check type rules
  • BaseEvaluator, TechnicalAccuracyEvaluator, and orchestrator wiring
  • Scorer severity priority logic

Out of scope

  • judge type rules (unaffected — score is a direct LLM rating, not density-based)
  • Filtering logic itself (unchanged)

Behavior impact

  • Files with all violations filtered will now show 10.0/10 instead of a penalized score
  • Files with a mix of surfaced and filtered violations will show a higher score than before
  • Rules with explicit severity: warning in frontmatter now correctly override DefaultSeverity=error from config
  • No change to which violations are displayed — only the score value and severity assignment

Risk

  • Low. calculateCheckScore is a pure function; only its call site and inputs changed. The judge path is untouched.
  • CheckResult remains in schema.ts for any external consumers; PromptEvaluationResult now uses RawCheckResult internally.
  • The severity priority fix changes behavior for users with both DefaultSeverity=error in config and severity: warning in rule frontmatter — violations from those rules will now surface as warnings rather than errors.

How to test/verify

Checks run

npm run test:ci   # 220/220 passed
npm run build     # clean
npm run lint      # clean

Manual verification

Run a check type rule against a file where you expect the LLM to return some violations that fail the confidence gate. Before this change the score would be lower than what the displayed issue count implies. After this change, the score should match only what's shown.

Rollback

Revert commits 53fcdc5, 45e6905, a287fb8 in order. No schema migrations or config changes involved.

Summary by CodeRabbit

  • Refactor
    • Moved scoring computation to a central step so check results now deliver raw violations and word count; scores and severity are derived later for consistent reporting and messaging.
  • Tests
    • Updated and added tests to reflect the new result shape and to ensure displayed scores reflect only surfaced (unfiltered) violations.

- Introduce RawCheckResult type: violations + word_count, no score fields
- Evaluators (base, accuracy) return RawCheckResult; scoring deferred
- Orchestrator calls calculateCheckScore after gate filtering, so only
  violations surfaced to the user contribute to the density penalty
- Add regression test: 1 filtered violation must not move score from
  9.0 to 8.0
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55774477-1924-4c78-9c0b-6698b2621934

📥 Commits

Reviewing files that changed from the base of the PR and between 53fcdc5 and a287fb8.

📒 Files selected for processing (2)
  • src/prompts/schema.ts
  • src/scoring/scorer.ts

📝 Walkthrough

Walkthrough

Evaluators now return RawCheckResult objects containing violations and word_count instead of computed score fields. Scoring responsibility (calculateCheckScore) moved to the orchestrator, which computes severity, message, and final_score from RawCheckResult and attaches scored details to reporting outputs. Types and tests updated accordingly.

Changes

Cohort / File(s) Summary
Orchestrator Scoring Integration
src/cli/orchestrator.ts
Import and use calculateCheckScore to score RawCheckResult for non-judge evaluations; replace direct uses of result.severity, result.message, and result.final_score with scored equivalents; update score-entry construction and Quality Scores output.
Evaluator Implementations
src/evaluators/base-evaluator.ts, src/evaluators/accuracy-evaluator.ts
Evaluators return RawCheckResult (type, violations, word_count, optional reasoning/usage/raw_model_output) instead of computing or returning score fields; removed evaluator-side scoring calls.
Prompt Schema / Types
src/prompts/schema.ts
Added RawCheckResult type; updated PromptEvaluationResult union to `JudgeResult
Scoring Logic
src/scoring/scorer.ts
Narrowed CheckScoringOptions.promptSeverity type to only allow `Severity.WARNING
Tests & Fixtures
tests/orchestrator-filtering.test.ts, tests/scoring-types.test.ts
Updated test helpers and assertions to produce/expect RawCheckResult shape (violations, word_count) instead of score fields; added test ensuring scoring uses only surfaced violations.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Orchestrator
    participant Evaluator
    participant Scorer

    Client->>Orchestrator: Request evaluation
    Orchestrator->>Evaluator: runCheckEvaluation(content)
    Evaluator->>Evaluator: analyze content -> produce violations, word_count
    Evaluator-->>Orchestrator: RawCheckResult(violations, word_count)

    alt violations present
        Orchestrator->>Scorer: calculateCheckScore(violations, word_count, options)
        Scorer-->>Orchestrator: scoredResult(severity, message, final_score)
    else no violations
        Orchestrator->>Scorer: calculateCheckScore([], word_count, options)
        Scorer-->>Orchestrator: scoredResult(default severity/message/final_score)
    end

    Orchestrator->>Orchestrator: attach scoredResult to Quality Scores output
    Orchestrator-->>Client: Final evaluation (Judge or scored Check result)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Hops of bytes and scores in flight,

Violations whispered, raw and bright.
Orchestrator tallies, then sings the score,
Evaluators leave the counting for more.
A rabbit cheers — cleaner paths galore! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: scoring should reflect only surfaced violations, not pre-filtered ones, which is the core intent of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/penalize-score

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/prompts/schema.ts (1)

370-380: Redundant type guards with misleading naming.

isCheckResult and isRawCheckResult are identical implementations. Additionally, isCheckResult narrowing to RawCheckResult is semantically confusing since CheckResult is a distinct type (used by calculateCheckScore).

Consider either:

  1. Remove isCheckResult and keep only isRawCheckResult (update call sites), or
  2. Rename isCheckResult to isRawCheckResult and remove the duplicate
♻️ Proposed fix: remove duplicate and clarify naming
-export function isCheckResult(
-  result: PromptEvaluationResult
-): result is RawCheckResult {
-  return result.type === EvaluationType.CHECK;
-}
-
 export function isRawCheckResult(
   result: PromptEvaluationResult
 ): result is RawCheckResult {
   return result.type === EvaluationType.CHECK;
 }

Then update the single call site in orchestrator.ts to use isRawCheckResult or simply !isJudgeResult(result).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/prompts/schema.ts` around lines 370 - 380, The two identical type guards
isCheckResult and isRawCheckResult cause confusion; remove isCheckResult and
keep a single clearly named guard isRawCheckResult (which narrows
PromptEvaluationResult to RawCheckResult), then update all call sites (e.g., in
orchestrator.ts) to use isRawCheckResult or !isJudgeResult(result) where
appropriate; ensure references to calculateCheckScore and the distinct
CheckResult type remain unchanged and that no usages expect isCheckResult to
exist.
src/evaluators/accuracy-evaluator.ts (1)

63-73: Outdated comment references removed fields.

The comment mentions "empty items array, perfect score" but RawCheckResult no longer has items or a precomputed score. The code is correct; only the comment needs updating.

📝 Proposed fix
-    // If no claims found, return success (empty items array, perfect score)
-    // Use the scoring module to calculate result
+    // If no claims found, return success (empty violations)
     if (claims.length === 0) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluators/accuracy-evaluator.ts` around lines 63 - 73, Update the
outdated comment above the early-return branch that checks if (claims.length ===
0): replace "empty items array, perfect score" with a short accurate description
of current behavior (e.g., "no claims found — return RawCheckResult with no
violations and word_count set"), and ensure references to countWords,
RawCheckResult, EvaluationType.CHECK and optional claimUsage remain correct so
the comment matches the code’s actual returned shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/evaluators/accuracy-evaluator.ts`:
- Around line 63-73: Update the outdated comment above the early-return branch
that checks if (claims.length === 0): replace "empty items array, perfect score"
with a short accurate description of current behavior (e.g., "no claims found —
return RawCheckResult with no violations and word_count set"), and ensure
references to countWords, RawCheckResult, EvaluationType.CHECK and optional
claimUsage remain correct so the comment matches the code’s actual returned
shape.

In `@src/prompts/schema.ts`:
- Around line 370-380: The two identical type guards isCheckResult and
isRawCheckResult cause confusion; remove isCheckResult and keep a single clearly
named guard isRawCheckResult (which narrows PromptEvaluationResult to
RawCheckResult), then update all call sites (e.g., in orchestrator.ts) to use
isRawCheckResult or !isJudgeResult(result) where appropriate; ensure references
to calculateCheckScore and the distinct CheckResult type remain unchanged and
that no usages expect isCheckResult to exist.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9a53b24-b057-4261-98d8-be53dacbb838

📥 Commits

Reviewing files that changed from the base of the PR and between bb566b3 and 53fcdc5.

📒 Files selected for processing (6)
  • src/cli/orchestrator.ts
  • src/evaluators/accuracy-evaluator.ts
  • src/evaluators/base-evaluator.ts
  • src/prompts/schema.ts
  • tests/orchestrator-filtering.test.ts
  • tests/scoring-types.test.ts

- promptSeverity !== undefined now wins unconditionally; previously only
  Severity.ERROR was treated as authoritative, so an explicit
  'severity: warning' in rule frontmatter was silently overridden by
  DefaultSeverity=error from config
- Remove unused isRawCheckResult type guard (duplicate of isCheckResult)
- Remove | string from CheckScoringOptions.promptSeverity; callers pass
  meta.severity which is Zod-validated as nativeEnum(Severity) at the
  config boundary, so the string widening was unnecessary
- Drop the as-cast that was required to compensate
@oshorefueled oshorefueled merged commit 581751f into main Mar 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant