Only score based on surfaced violations only by oshorefueled · Pull Request #69 · TRocket-Labs/vectorlint

oshorefueled · 2026-03-04T13:25:20Z

For check type rules, the quality score was calculated from all violations the LLM returned, before confidence and gate filtering ran. A file could show 0 displayed issues even though it received a penalized score, because filtered-out candidates still contributed to the density calculation.

What

check rule scores now reflect only violations that are surfaced to the user
RawCheckResult type introduced: evaluators return violations + word count without a pre-baked score
calculateCheckScore moved to the orchestrator, called once after getViolationFilterResults.
Scorer priority fixed: explicit severity: warning in rule frontmatter now wins over DefaultSeverity=error from config (previously only severity: error was treated as authoritative)
promptSeverity type tightened in CheckScoringOptions — removed | string widening that was unnecessary given Zod validation at the config boundary

Scope

In scope

Score calculation for check type rules
BaseEvaluator, TechnicalAccuracyEvaluator, and orchestrator wiring
Scorer severity priority logic

Out of scope

judge type rules (unaffected — score is a direct LLM rating, not density-based)
Filtering logic itself (unchanged)

Behavior impact

Files with all violations filtered will now show 10.0/10 instead of a penalized score
Files with a mix of surfaced and filtered violations will show a higher score than before
Rules with explicit severity: warning in frontmatter now correctly override DefaultSeverity=error from config
No change to which violations are displayed — only the score value and severity assignment

Risk

Low. calculateCheckScore is a pure function; only its call site and inputs changed. The judge path is untouched.
CheckResult remains in schema.ts for any external consumers; PromptEvaluationResult now uses RawCheckResult internally.
The severity priority fix changes behavior for users with both DefaultSeverity=error in config and severity: warning in rule frontmatter — violations from those rules will now surface as warnings rather than errors.

How to test/verify

Checks run

npm run test:ci   # 220/220 passed
npm run build     # clean
npm run lint      # clean

Manual verification

Run a check type rule against a file where you expect the LLM to return some violations that fail the confidence gate. Before this change the score would be lower than what the displayed issue count implies. After this change, the score should match only what's shown.

Rollback

Revert commits 53fcdc5, 45e6905, a287fb8 in order. No schema migrations or config changes involved.

Summary by CodeRabbit

Refactor
- Moved scoring computation to a central step so check results now deliver raw violations and word count; scores and severity are derived later for consistent reporting and messaging.
Tests
- Updated and added tests to reflect the new result shape and to ensure displayed scores reflect only surfaced (unfiltered) violations.

- Introduce RawCheckResult type: violations + word_count, no score fields - Evaluators (base, accuracy) return RawCheckResult; scoring deferred - Orchestrator calls calculateCheckScore after gate filtering, so only violations surfaced to the user contribute to the density penalty - Add regression test: 1 filtered violation must not move score from 9.0 to 8.0

coderabbitai · 2026-03-04T13:25:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55774477-1924-4c78-9c0b-6698b2621934

📥 Commits

Reviewing files that changed from the base of the PR and between 53fcdc5 and a287fb8.

📒 Files selected for processing (2)

src/prompts/schema.ts
src/scoring/scorer.ts

📝 Walkthrough

Walkthrough

Evaluators now return RawCheckResult objects containing violations and word_count instead of computed score fields. Scoring responsibility (calculateCheckScore) moved to the orchestrator, which computes severity, message, and final_score from RawCheckResult and attaches scored details to reporting outputs. Types and tests updated accordingly.

Changes

Cohort / File(s)	Summary
Orchestrator Scoring Integration `src/cli/orchestrator.ts`	Import and use `calculateCheckScore` to score RawCheckResult for non-judge evaluations; replace direct uses of `result.severity`, `result.message`, and `result.final_score` with scored equivalents; update score-entry construction and Quality Scores output.
Evaluator Implementations `src/evaluators/base-evaluator.ts`, `src/evaluators/accuracy-evaluator.ts`	Evaluators return `RawCheckResult` (type, violations, word_count, optional reasoning/usage/raw_model_output) instead of computing or returning score fields; removed evaluator-side scoring calls.
Prompt Schema / Types `src/prompts/schema.ts`	Added `RawCheckResult` type; updated `PromptEvaluationResult` union to `JudgeResult
Scoring Logic `src/scoring/scorer.ts`	Narrowed `CheckScoringOptions.promptSeverity` type to only allow `Severity.WARNING
Tests & Fixtures `tests/orchestrator-filtering.test.ts`, `tests/scoring-types.test.ts`	Updated test helpers and assertions to produce/expect `RawCheckResult` shape (`violations`, `word_count`) instead of score fields; added test ensuring scoring uses only surfaced violations.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Orchestrator
    participant Evaluator
    participant Scorer

    Client->>Orchestrator: Request evaluation
    Orchestrator->>Evaluator: runCheckEvaluation(content)
    Evaluator->>Evaluator: analyze content -> produce violations, word_count
    Evaluator-->>Orchestrator: RawCheckResult(violations, word_count)

    alt violations present
        Orchestrator->>Scorer: calculateCheckScore(violations, word_count, options)
        Scorer-->>Orchestrator: scoredResult(severity, message, final_score)
    else no violations
        Orchestrator->>Scorer: calculateCheckScore([], word_count, options)
        Scorer-->>Orchestrator: scoredResult(default severity/message/final_score)
    end

    Orchestrator->>Orchestrator: attach scoredResult to Quality Scores output
    Orchestrator-->>Client: Final evaluation (Judge or scored Check result)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

refactor: standardize evaluation terminology to Check and Judge #52: Overlaps orchestrator/evaluator scoring integration and use of calculateCheckScore.
Rename rule types from semi-objective/subjective to check/judge #49: Related schema and evaluator changes (evaluation-type handling and result shape adjustments).
feat: implement content chunking and dedicated scoring logic for evaluators #39: Related refactor of evaluator scoring responsibilities and base-evaluator changes.

Poem

🐰 Hops of bytes and scores in flight,

Violations whispered, raw and bright.
Orchestrator tallies, then sings the score,
Evaluators leave the counting for more.
A rabbit cheers — cleaner paths galore! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: scoring should reflect only surfaced violations, not pre-filtered ones, which is the core intent of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/penalize-score

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/prompts/schema.ts (1)
370-380: Redundant type guards with misleading naming.

isCheckResult and isRawCheckResult are identical implementations. Additionally, isCheckResult narrowing to RawCheckResult is semantically confusing since CheckResult is a distinct type (used by calculateCheckScore).

Consider either:

Remove isCheckResult and keep only isRawCheckResult (update call sites), or

Rename isCheckResult to isRawCheckResult and remove the duplicate
♻️ Proposed fix: remove duplicate and clarify naming
-export function isCheckResult(
-  result: PromptEvaluationResult
-): result is RawCheckResult {
-  return result.type === EvaluationType.CHECK;
-}
-
 export function isRawCheckResult(
   result: PromptEvaluationResult
 ): result is RawCheckResult {
   return result.type === EvaluationType.CHECK;
 }
Then update the single call site in orchestrator.ts to use isRawCheckResult or simply !isJudgeResult(result).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/prompts/schema.ts` around lines 370 - 380, The two identical type guards
isCheckResult and isRawCheckResult cause confusion; remove isCheckResult and
keep a single clearly named guard isRawCheckResult (which narrows
PromptEvaluationResult to RawCheckResult), then update all call sites (e.g., in
orchestrator.ts) to use isRawCheckResult or !isJudgeResult(result) where
appropriate; ensure references to calculateCheckScore and the distinct
CheckResult type remain unchanged and that no usages expect isCheckResult to
exist.
src/evaluators/accuracy-evaluator.ts (1)
63-73: Outdated comment references removed fields.

The comment mentions "empty items array, perfect score" but RawCheckResult no longer has items or a precomputed score. The code is correct; only the comment needs updating.
📝 Proposed fix
-    // If no claims found, return success (empty items array, perfect score)
-    // Use the scoring module to calculate result
+    // If no claims found, return success (empty violations)
     if (claims.length === 0) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluators/accuracy-evaluator.ts` around lines 63 - 73, Update the
outdated comment above the early-return branch that checks if (claims.length ===
0): replace "empty items array, perfect score" with a short accurate description
of current behavior (e.g., "no claims found — return RawCheckResult with no
violations and word_count set"), and ensure references to countWords,
RawCheckResult, EvaluationType.CHECK and optional claimUsage remain correct so
the comment matches the code’s actual returned shape.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/evaluators/accuracy-evaluator.ts`:
- Around line 63-73: Update the outdated comment above the early-return branch
that checks if (claims.length === 0): replace "empty items array, perfect score"
with a short accurate description of current behavior (e.g., "no claims found —
return RawCheckResult with no violations and word_count set"), and ensure
references to countWords, RawCheckResult, EvaluationType.CHECK and optional
claimUsage remain correct so the comment matches the code’s actual returned
shape.

In `@src/prompts/schema.ts`:
- Around line 370-380: The two identical type guards isCheckResult and
isRawCheckResult cause confusion; remove isCheckResult and keep a single clearly
named guard isRawCheckResult (which narrows PromptEvaluationResult to
RawCheckResult), then update all call sites (e.g., in orchestrator.ts) to use
isRawCheckResult or !isJudgeResult(result) where appropriate; ensure references
to calculateCheckScore and the distinct CheckResult type remain unchanged and
that no usages expect isCheckResult to exist.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9a53b24-b057-4261-98d8-be53dacbb838

📥 Commits

Reviewing files that changed from the base of the PR and between bb566b3 and 53fcdc5.

📒 Files selected for processing (6)

src/cli/orchestrator.ts
src/evaluators/accuracy-evaluator.ts
src/evaluators/base-evaluator.ts
src/prompts/schema.ts
tests/orchestrator-filtering.test.ts
tests/scoring-types.test.ts

- promptSeverity !== undefined now wins unconditionally; previously only Severity.ERROR was treated as authoritative, so an explicit 'severity: warning' in rule frontmatter was silently overridden by DefaultSeverity=error from config - Remove unused isRawCheckResult type guard (duplicate of isCheckResult)

- Remove | string from CheckScoringOptions.promptSeverity; callers pass meta.severity which is Zod-validated as nativeEnum(Severity) at the config boundary, so the string widening was unnecessary - Drop the as-cast that was required to compensate

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

oshorefueled added 2 commits March 4, 2026 14:41

oshorefueled merged commit 581751f into main Mar 4, 2026
3 checks passed

coderabbitai bot mentioned this pull request Mar 4, 2026

Remove zero-violation JSON noise and restore concise LLM guidance #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only score based on surfaced violations only#69

Only score based on surfaced violations only#69
oshorefueled merged 3 commits intomainfrom
fix/penalize-score

oshorefueled commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oshorefueled commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Scope

Behavior impact

Risk

How to test/verify

Rollback

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oshorefueled commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading