Conversation
Implements withRetry function for handling transient LLM API failures with configurable retry attempts and detailed logging. Changes: - Add src/evaluators/retry.ts with withRetry function - Accepts operation, maxRetries (default 3), and context string - Logs each retry attempt with [vectorlint] prefix for debugging - Returns RetryResult<T> with data and attempt count - Export from src/evaluators/index.ts for use by phase runners - Add comprehensive unit tests (9 tests, all passing) - Include Property 5 test for retry behavior verification Purpose: Provides foundational retry logic for detection and suggestion phases in the two-phase evaluation architecture. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add runPromptUnstructured method to LLMProvider interface - Implement runPromptUnstructured in OpenAI, Anthropic, Azure OpenAI, and Gemini providers - Returns raw text response instead of structured JSON - Includes debug logging, error handling, and token usage tracking - Add unit tests for unstructured provider methods Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add detection-phase key to src/evaluators/prompts.json with:
- Guided format instructions for output with markdown example
- Issue N output format (Issue 1, Issue 2, etc.) as section headers
- Required fields: quotedText, contextBefore, contextAfter, line, criterionName, analysis
- {criteria} placeholder for dynamic criteria insertion
- Guidelines for issue ordering, context requirements, and "No issues found" handling
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add DetectionPhaseRunner class as the first phase of the two-phase detection/suggestion architecture. The detection phase identifies issues in content based on evaluation criteria using an unstructured LLM prompt. Changes: - Create src/evaluators/detection-phase.ts with DetectionPhaseRunner - Implement run(content, criteria, options): Promise<DetectionResult> - Build detection prompt with criteria from PromptFile via getPrompt() - Use runPromptUnstructured for LLM call to get free-form text response - Integrate retry logic from retry.ts for transient failure handling - Include basic markdown parser for Issue N sections - Define types: RawDetectionIssue, DetectionResult, DetectionPhaseOptions - Export from src/evaluators/index.ts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tests - Fix analysis regex bug in DetectionPhaseRunner.parseIssueSection() - Add comprehensive test suite with 17 tests including Property 2 - Parser correctly extracts all required fields from markdown responses - Gracefully handles malformed sections by skipping them - All tests pass Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add suggestion-phase prompt template to src/evaluators/prompts.json:
- Universal template for all rule types (check, judge, style guide)
- Placeholders for {content}, {issues}, and {criteria}
- Output format with ## Issue N sections matching detection phase
- Each suggestion includes actionable fix and explanation
- Instructs exactly one suggestion per issue
- Guidelines for preserving voice/tone and avoiding over-editing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add buildSuggestionLLMSchema() to src/prompts/schema.ts
- Schema defines array of {issueIndex, suggestion, explanation}
- Strict mode enabled for structured output validation
- Add SuggestionLLMResult TypeScript type
- Schema name: vectorlint_suggestion_result
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create SuggestionPhaseRunner class in src/evaluators/suggestion-phase.ts - Implement run(content, issues, criteria) method using runPromptStructured - Add buildPrompt method to format content, issues, and criteria - Add formatIssues method to convert RawDetectionIssue to markdown - Integrate withRetry logic for transient failure handling - Export SuggestionPhaseRunner, Suggestion, SuggestionResult, SuggestionPhaseOptions types - Add comprehensive test suite with 12 tests in tests/suggestion-phase.test.ts - Property 4 tests verify suggestion-to-issue matching by index - Update PRD to mark feature as complete - Update progress.txt with implementation details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implemented ResultAssembler class that combines detection and suggestion phase results into final CheckResult/JudgeResult formats. - assembleCheckResult: Merges detection issues with suggestions for check-style evaluation results - assembleJudgeResult: Groups violations by criterion for judge-style evaluation results - aggregateTokenUsage: Correctly combines token usage from both phases Property 6: Result schema conformance - all tests verify output matches CheckResult and JudgeResult schema requirements Property 7: Token usage aggregation - tests verify correct summation of input/output tokens from both phases All 22 tests pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add DetectionPhaseRunner, SuggestionPhaseRunner, and ResultAssembler to BaseEvaluator - Rewrite runCheckEvaluation to use detection→suggestion→assembly flow - Rewrite runJudgeEvaluation to use detection→suggestion→assembly flow - Pass full document to suggestion phase even when chunking (Property 3) - Add buildCriteriaString helper for prompt criteria formatting - Update ResultAssembler to handle string strictness and PromptCriterionSpec - Add Property 1 and Property 3 tests in base-evaluator-two-phase.test.ts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix azure-openai-provider.test.ts mock syntax and error expectations - Fix anthropic-provider.test.ts mock variable scope and instanceof checks - Fix openai-provider.test.ts mock variable scope - Update scoring-types.test.ts for two-phase detection/suggestion flow: - Mock both runPromptUnstructured (detection) and runPromptStructured (suggestion) - Fix detection response format to match parser expectations - Update score expectations for new density-based scoring formula - Add beforeEach hook to clear mocks between tests All 394 tests pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added "Two-Phase Detection/Suggestion Architecture" section explaining all three phases - Documented key components table with file locations and purposes - Listed all 7 property tests that validate the architecture - Updated LLM Provider Methods subsection with both structured and unstrained calls - Updated project structure to include new evaluator files - Marked final PRD task as complete Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused imports from test files (DetectionResult, ResultAssemblerOptions, RetryResult, DefaultRequestBuilder, JudgeLLMResult, CheckLLMResult) - Replace unsafe 'any' type assertions with proper type assertions using 'unknown' intermediate type - Fix variable naming conventions (camelCase to UPPER_CASE for constants) - Fix async arrow function warnings by removing async keyword where await is not used - Add eslint-disable comments for unbound-method warnings on vi.mocked() calls - Add eslint-disable comments for unsafe type assignments that are necessary for test mocking - Remove unused imports from src/evaluators/base-evaluator.ts (unused schema and scoring functions) - Fix variable naming in src/evaluators/result-assembler.ts (final_score to finalScore) All 394 tests still pass. Lint is now clean. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add runtime validation using Zod to ensure LLM responses in the suggestion
phase conform to the expected schema, providing an additional layer of safety
beyond the structured output schema.
Changes:
- Add SUGGESTION_LLM_RESULT_SCHEMA Zod schema in src/prompts/schema.ts
- Validates suggestions array with issueIndex (positive int), suggestion
(non-empty string), and explanation (non-empty string)
- Update src/evaluators/suggestion-phase.ts to use Zod validation
- Import SUGGESTION_LLM_RESULT_SCHEMA
- Parse llmResult.data before using it
- Add 4 test cases in tests/suggestion-phase.test.ts for validation:
- Missing required field
- Wrong type for issueIndex
- Invalid value (zero issueIndex)
- Empty string for suggestion
All 398 tests pass (increased from 394).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update catch block from 'catch {' to 'catch (_e: unknown)' with
explanatory comment as per PRD task requirements.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ng RetryResult wrapper - Inline buildCheckMessage(), buildCriterionSummary(), buildCriterionReasoning(), normalizeStrictness(), and calculateCriterionScore() in result-assembler.ts - Remove RetryResult<T> wrapper from retry.ts; withRetry() now returns T directly - Update detection-phase.ts and suggestion-phase.ts to use new withRetry signature - Remove @example JSDoc blocks from exported classes (ResultAssembler, DetectionPhaseRunner, SuggestionPhaseRunner) - Update tests/retry.test.ts to use direct return value instead of RetryResult Line count reduced from ~955 to 822 lines (-133 lines, ~14% reduction) All 398 tests pass, build compiles, lint clean. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR implements a two-phase evaluation architecture that replaces single-pass evaluation. It introduces detection (unstructured LLM calls to identify issues), suggestion (structured LLM calls to propose fixes), result assembly, a retry utility with logging, and extends LLM providers with unstructured prompt support. Comprehensive tests validate the new workflow. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Evaluator (BaseEvaluator)
participant DetectionRunner as DetectionPhaseRunner
participant LLM as LLM Provider
participant SuggestionRunner as SuggestionPhaseRunner
participant Assembler as ResultAssembler
Client->>DetectionRunner: run(content chunk, criteria)
loop For each content chunk
DetectionRunner->>LLM: runPromptUnstructured(detection prompt)
LLM-->>DetectionRunner: raw text response + usage
end
DetectionRunner-->>Client: RawDetectionIssue[], TokenUsage
alt Issues detected
Client->>SuggestionRunner: run(full document, issues, criteria)
SuggestionRunner->>LLM: runPromptStructured(suggestion prompt)
LLM-->>SuggestionRunner: JSON suggestions + usage
SuggestionRunner-->>Client: Suggestion[], TokenUsage
Client->>Assembler: assembleCheckResult/assembleJudgeResult(issues, suggestions)
Assembler->>Assembler: mergeIssuesWithSuggestions()
Assembler->>Assembler: aggregateTokenUsage()
Assembler-->>Client: CheckResult/JudgeResult with final_score
else No issues
Client->>Assembler: assembleCheckResult/assembleJudgeResult(no issues, [])
Assembler-->>Client: CheckResult/JudgeResult (perfect score)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In @src/evaluators/result-assembler.ts:
- Around line 206-239: The comment claiming normalized_score maps 1–4 to 1–10 is
incorrect: normalized_score = score * 2.5 yields 2.5–10; update the code in the
result assembly to either (A) correct the comment to state the actual 2.5–10
range for normalized_score, or (B) change the normalization formula to map 1→1
and 4→10 by replacing score * 2.5 with the linear mapping (score - 1) * 3 + 1
(apply this change where normalized_score is computed and keep the comment
consistent).
In @src/evaluators/retry.ts:
- Around line 30-31: Initialize and/or properly handle the case when maxRetries
can be 0 so lastError is never left undefined; either validate maxRetries to be
>= 1 up front or set lastError to a meaningful default Error before the retry
loop (or after the loop check) and then throw that descriptive error instead of
undefined. Update the logic around maxRetries, lastError, and the retry loop in
retry.ts (symbols: maxRetries, lastError, the retry loop) so callers passing
maxRetries: 0 get a clear error message rather than an undefined throw.
In @src/providers/gemini-provider.ts:
- Around line 96-138: The model instance created in the constructor was set with
responseMimeType: "application/json" causing runPromptUnstructured to still
request JSON; update runPromptUnstructured to create or obtain a temporary
unstructured model (e.g., clone or instantiate a new Model/Client without
responseMimeType or with responseMimeType: "text/markdown" or "text/plain") and
call generateContent on that unstructured model instead of this.model; ensure
you reuse the same client/config values (model name, temperature) and clean up
or reuse the temporary model reference as appropriate to avoid changing the
constructor-initialized JSON model used elsewhere.
In @tests/azure-openai-provider.test.ts:
- Around line 362-419: The Gemini provider is logging debug output with
console.error while other providers use console.log; locate the Gemini provider
class (e.g., GeminiProvider) and replace any debug console.error calls in its
debug/logging code paths (methods like runPromptUnstructured /
runPromptStructured or any place using console.error for "[vectorlint]" debug
messages) with console.log, preserving the exact log message and payload shape
so tests and consumers remain consistent.
🧹 Nitpick comments (9)
src/providers/openai-provider.ts (1)
181-182: Consider extracting shared logic to reduce duplication.The
runPromptUnstructuredmethod shares ~80% of its code withrunPromptStructured(params building, debug logging, error handling, response validation). Additionally,buildPromptBodyForStructuredis used in an unstructured context, which is slightly misleading.Consider extracting common logic into private helper methods:
private buildBaseParams(content: string, promptText: string): OpenAI.Chat.Completions.ChatCompletionCreateParams { ... } private handleApiError(e: unknown): never { ... } private logDebugInfo(params: ..., response?: ...): void { ... }This would keep the provider thin per coding guidelines while reducing maintenance burden.
src/prompts/schema.ts (1)
97-131: JSON schema and Zod schema have validation inconsistencies.The JSON schema allows:
issueIndexas any number (including floats and zero)- Empty strings for
suggestionandexplanationBut the Zod schema requires:
issueIndexto be a positive integer- Non-empty strings (
.min(1))If the LLM returns
{"issueIndex": 0, "suggestion": ""}, it passes the JSON schema but fails Zod validation at runtime.♻️ Align JSON schema with Zod constraints
properties: { issueIndex: { - type: "number", + type: "integer", + minimum: 1, description: "The index of the issue this suggestion addresses (1-based, matching Issue 1, Issue 2, etc.)", }, suggestion: { type: "string", + minLength: 1, description: "Specific, actionable text to replace the problematic content", }, explanation: { type: "string", + minLength: 1, description: "Brief explanation of how this suggestion addresses the issue", }, },Also applies to: 175-183
AGENTS.md (1)
165-165: Capitalize "Markdown" as a proper noun.-- LLM returns free-form markdown text with `## Issue N` sections +- LLM returns free-form Markdown text with `## Issue N` sectionstests/suggestion-phase.test.ts (1)
130-154: Consider testing formatIssues through the public interface instead of type assertion.While testing private methods can be useful, the type assertion pattern (
runner as unknown as { formatIssues: ... }) is fragile. If the method name or signature changes, these tests will fail at runtime rather than compile time. Consider either:
- Testing this behavior through
run()by verifying the prompt content- Making
formatIssuesprotected if it's part of the class's testable contractsrc/evaluators/suggestion-phase.ts (1)
149-152: Potential issue: Template replacement may be fragile if content contains placeholder strings.The simple
replace()calls could have unintended effects if the content, issues text, or criteria string contains literal{content},{issues}, or{criteria}substrings. While unlikely in practice, this could cause partial replacements or double-replacements.Consider using a more robust templating approach or replacing placeholders sequentially in a way that prevents re-matching.
♻️ Optional: Use more unique placeholders or sequential replacement
private buildPrompt( content: string, issues: RawDetectionIssue[], criteria: string ): string { const template = getPrompt("suggestion-phase"); // Format issues for inclusion in the prompt const issuesText = this.formatIssues(issues); - return template - .replace("{content}", content) - .replace("{issues}", issuesText) - .replace("{criteria}", criteria); + // Replace placeholders in sequence to avoid re-matching + let result = template.replace("{criteria}", criteria); + result = result.replace("{issues}", issuesText); + result = result.replace("{content}", content); + return result; }tests/base-evaluator-two-phase.test.ts (2)
92-95: Weak assertion:toBeGreaterThanOrEqual(0)is always true for array length.The assertion on line 94 doesn't verify meaningful behavior—
structuredCalls.lengthwill always be >= 0. Consider removing this assertion or replacing it with a more specific expectation based on whether issues were detected.♻️ Consider removing or strengthening the assertion
// The mock LLM's runPromptUnstructured should have been called for detection const unstructuredCalls = (mockLLM.runPromptUnstructured as ReturnType<typeof vi.fn>).mock.calls; expect(unstructuredCalls.length).toBeGreaterThan(0); - // Verify structured call for suggestion was also made if issues were found - const structuredCalls = (mockLLM.runPromptStructured as ReturnType<typeof vi.fn>).mock.calls; - expect(structuredCalls.length).toBeGreaterThanOrEqual(0); + // When no issues are detected (default mock returns "No issues found."), + // suggestion phase should not be called + const structuredCalls = (mockLLM.runPromptStructured as ReturnType<typeof vi.fn>).mock.calls; + expect(structuredCalls.length).toBe(0);
460-468: Incomplete mock detection response may not parse correctly.The detection response on line 464 is missing
contextBeforeandcontextAfterfields that other tests include. If the detection parser requires these fields, this test may not accurately simulate real behavior.♻️ Consider using full detection response format for consistency
(mockLLM.runPromptUnstructured as ReturnType<typeof vi.fn>).mockImplementation( () => { detectionCallCount++; return { - data: "## Issue 1\n\n**quotedText:**\n> problem\n\n**line:** 42\n\n**criterionName:** Criterion 1\n\n**analysis:**\nIssue", + data: `## Issue 1 + +**quotedText:** +> problem + +**contextBefore:** +before + +**contextAfter:** +after + +**line:** 42 + +**criterionName:** Criterion 1 + +**analysis:** +This is an issue`, usage: { inputTokens: 100, outputTokens: 50 }, }; } );src/evaluators/result-assembler.ts (2)
57-61: Consider the default value fortotalWordCount.Defaulting
totalWordCountto1avoids division by zero but will produce a heavily penalized score if the caller forgets to pass the actual word count. A single violation would result in a score calculation of10 - ((1/1)*100*strictness)*2, which is clamped to1.Consider logging a warning or using a more representative default (e.g.,
100) to make misconfiguration more obvious during testing.
305-311: Clarify the 1-based indexing convention.The mapping uses
index + 1to look up suggestions, implyingSuggestion.issueIndexis 1-based while the issues array is 0-based. Consider adding a brief comment to make this convention explicit, as it's a key coupling between detection and suggestion phases.return issues.map((issue, index) => { + // Suggestions use 1-based issueIndex; issues array is 0-based const matchingSuggestion = suggestionMap.get(index + 1);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
.gitignoreAGENTS.mdralph/prd.jsonralph/progress.txtsrc/evaluators/base-evaluator.tssrc/evaluators/detection-phase.tssrc/evaluators/index.tssrc/evaluators/prompts.jsonsrc/evaluators/result-assembler.tssrc/evaluators/retry.tssrc/evaluators/suggestion-phase.tssrc/prompts/schema.tssrc/providers/anthropic-provider.tssrc/providers/azure-openai-provider.tssrc/providers/gemini-provider.tssrc/providers/llm-provider.tssrc/providers/openai-provider.tstests/anthropic-provider.test.tstests/azure-openai-provider.test.tstests/base-evaluator-two-phase.test.tstests/detection-phase.test.tstests/gemini-provider.test.tstests/openai-provider.test.tstests/result-assembler.test.tstests/retry.test.tstests/scoring-types.test.tstests/suggestion-phase.test.ts
🧰 Additional context used
📓 Path-based instructions (3)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/evaluators/retry.tssrc/evaluators/index.tssrc/providers/gemini-provider.tssrc/evaluators/suggestion-phase.tssrc/providers/azure-openai-provider.tssrc/providers/llm-provider.tssrc/prompts/schema.tssrc/evaluators/base-evaluator.tssrc/evaluators/result-assembler.tssrc/providers/openai-provider.tssrc/evaluators/detection-phase.tssrc/providers/anthropic-provider.ts
src/providers/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/providers/**/*.ts: Depend onLLMProviderandSearchProviderinterfaces; keep providers thin (transport only)
InjectRequestBuildervia provider constructor to avoid coupling
Files:
src/providers/gemini-provider.tssrc/providers/azure-openai-provider.tssrc/providers/llm-provider.tssrc/providers/openai-provider.tssrc/providers/anthropic-provider.ts
tests/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests
Files:
tests/result-assembler.test.tstests/suggestion-phase.test.tstests/detection-phase.test.tstests/scoring-types.test.tstests/retry.test.tstests/base-evaluator-two-phase.test.tstests/openai-provider.test.tstests/azure-openai-provider.test.tstests/gemini-provider.test.tstests/anthropic-provider.test.ts
🧠 Learnings (5)
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Applied to files:
tests/result-assembler.test.tstests/suggestion-phase.test.tstests/detection-phase.test.tstests/scoring-types.test.tstests/retry.test.tstests/base-evaluator-two-phase.test.tstests/azure-openai-provider.test.tstests/gemini-provider.test.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to tests/**/*.test.ts : Use dependency injection in tests: mock providers; do not hit network in unit tests
Applied to files:
tests/detection-phase.test.tstests/azure-openai-provider.test.tstests/gemini-provider.test.tstests/anthropic-provider.test.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)
Applied to files:
tests/scoring-types.test.tssrc/providers/llm-provider.tssrc/evaluators/base-evaluator.tsAGENTS.md
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries
Applied to files:
src/prompts/schema.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to evals/**/*.md : Evals must include YAML frontmatter; the tool appends evidence instructions automatically
Applied to files:
AGENTS.md
🧬 Code graph analysis (20)
src/evaluators/retry.ts (1)
src/evaluators/index.ts (2)
RetryOptions(30-30)withRetry(30-30)
src/evaluators/index.ts (4)
src/evaluators/evaluator-registry.ts (1)
EvaluatorRegistry(27-59)src/output/reporter.ts (2)
printEvaluationSummaries(138-175)EvaluationSummary(7-11)src/cli/orchestrator.ts (1)
extractAndReportCriterion(257-463)src/evaluators/accuracy-evaluator.ts (1)
TechnicalAccuracyEvaluator(48-239)
src/providers/gemini-provider.ts (3)
src/providers/llm-provider.ts (1)
LLMResult(3-6)src/errors/index.ts (1)
handleUnknownError(46-51)tests/evaluator.test.ts (1)
runPromptStructured(10-14)
src/evaluators/suggestion-phase.ts (4)
src/providers/token-usage.ts (1)
TokenUsage(1-4)src/prompts/schema.ts (2)
buildSuggestionLLMSchema(97-131)SUGGESTION_LLM_RESULT_SCHEMA(175-183)src/evaluators/detection-phase.ts (1)
RawDetectionIssue(21-34)src/evaluators/retry.ts (1)
withRetry(26-57)
src/providers/azure-openai-provider.ts (3)
src/providers/llm-provider.ts (1)
LLMResult(3-6)src/errors/index.ts (1)
handleUnknownError(46-51)src/boundaries/api-client.ts (1)
validateApiResponse(9-20)
tests/detection-phase.test.ts (1)
src/evaluators/detection-phase.ts (2)
DetectionPhaseRunner(69-227)RawDetectionIssue(21-34)
tests/scoring-types.test.ts (4)
src/providers/llm-provider.ts (1)
LLMProvider(8-11)src/evaluators/base-evaluator.ts (1)
BaseEvaluator(42-297)tests/evaluator.test.ts (2)
runPromptStructured(10-14)FakeProvider(8-15)tests/provider-factory.test.ts (1)
it(227-292)
src/providers/llm-provider.ts (1)
tests/evaluator.test.ts (1)
runPromptStructured(10-14)
tests/retry.test.ts (1)
src/evaluators/retry.ts (1)
withRetry(26-57)
src/prompts/schema.ts (1)
src/output/rdjson-formatter.ts (1)
RdJsonSuggestion(35-47)
src/evaluators/base-evaluator.ts (5)
src/evaluators/evaluator.ts (1)
Evaluator(7-9)src/evaluators/result-assembler.ts (1)
ResultAssembler(43-347)src/providers/token-usage.ts (1)
TokenUsage(1-4)src/chunking/index.ts (1)
countWords(4-4)src/cli/orchestrator.ts (3)
evaluateFile(757-907)extractAndReportCriterion(257-463)routePromptResult(544-705)
src/providers/openai-provider.ts (2)
src/providers/llm-provider.ts (1)
LLMResult(3-6)src/errors/index.ts (1)
handleUnknownError(46-51)
tests/openai-provider.test.ts (2)
src/schemas/openai-responses.ts (1)
OpenAIResponse(43-43)src/providers/openai-provider.ts (1)
OpenAIProvider(23-270)
tests/azure-openai-provider.test.ts (1)
src/providers/azure-openai-provider.ts (1)
AzureOpenAIProvider(23-223)
AGENTS.md (4)
src/evaluators/accuracy-evaluator.ts (1)
TechnicalAccuracyEvaluator(48-239)tests/evaluator.test.ts (3)
FakeProvider(8-15)runPromptStructured(10-14)it(39-53)src/cli/orchestrator.ts (2)
evaluateFile(757-907)runPromptEvaluation(713-752)src/evaluators/evaluator.ts (1)
Evaluator(7-9)
tests/gemini-provider.test.ts (1)
src/providers/gemini-provider.ts (1)
GeminiProvider(20-139)
src/evaluators/detection-phase.ts (4)
src/providers/token-usage.ts (1)
TokenUsage(1-4)src/evaluators/retry.ts (1)
withRetry(26-57)tests/evaluator.test.ts (2)
runPromptStructured(10-14)FakeProvider(8-15)src/cli/types.ts (1)
RunPromptEvaluationParams(113-119)
tests/anthropic-provider.test.ts (3)
tests/schemas/mock-schemas.ts (1)
MockAnthropicClient(66-70)src/providers/anthropic-provider.ts (1)
AnthropicProvider(31-342)tests/anthropic-e2e.test.ts (3)
vi(62-71)default(18-52)MockAnthropicClient(20-24)
ralph/prd.json (1)
tests/evaluator.test.ts (2)
runPromptStructured(10-14)FakeProvider(8-15)
src/providers/anthropic-provider.ts (3)
src/providers/llm-provider.ts (1)
LLMResult(3-6)src/errors/index.ts (1)
handleUnknownError(46-51)src/schemas/anthropic-responses.ts (1)
isTextBlock(44-46)
🪛 LanguageTool
AGENTS.md
[uncategorized] ~165-~165: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...tUnstructured) - LLM returns free-form markdown text with ## Issue N` sections - Conte...
(MARKDOWN_NNP)
ralph/progress.txt
[uncategorized] ~54-~54: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...ded format instructions for output with markdown code block example - Output template sp...
(MARKDOWN_NNP)
[uncategorized] ~61-~61: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...ontent. The unstructured output format (markdown with Issue N sections) is designed to b...
(MARKDOWN_NNP)
[uncategorized] ~75-~75: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...sient failure handling - Includes basic markdown parser to extract issues from LLM respo...
(MARKDOWN_NNP)
[uncategorized] ~168-~168: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...hod converts RawDetectionIssue array to markdown format for prompt inclusion - Uses `run...
(MARKDOWN_NNP)
[style] ~234-~234: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ...tly typed - Optional fields handled correctly - CheckResult and JudgeResult outpu...
(ADVERB_REPETITION_PREMIUM)
[style] ~240-~240: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ...n suggestion is undefined - Returns only suggestion usage when detection is unde...
(ADVERB_REPETITION_PREMIUM)
[style] ~347-~347: The words ‘explanations’ and ‘explaining’ are quite similar. Consider replacing ‘explaining’ with a different word.
Context: ...haseRunnerhas detailed JSDoc comments explaining purpose -SuggestionPhaseRunner` has...
(VERB_NOUN_SENT_LEVEL_REP)
[style] ~430-~430: This phrase might be redundant. Consider either removing or replacing the adjective ‘additional’.
Context: ...rs Purpose: Zod runtime validation adds an additional layer of safety for the suggestion phase LLM ...
(ADD_AN_ADDITIONAL)
🔇 Additional comments (73)
.gitignore (1)
11-11: Verify intent:.kiro/directory is no longer ignored.Commenting out this entry means the
.kiro/directory could now be tracked by Git if present. This change appears unrelated to the two-phase evaluation architecture. Was this intentional, or should this remain ignored to avoid committing tool-specific files?src/providers/llm-provider.ts (1)
8-11: LGTM!Clean interface extension adding unstructured prompt support. The signature mirrors
runPromptStructuredappropriately (minus the schema parameter), andLLMResult<string>is the correct return type for raw text responses. This keeps providers thin as transport-only components per coding guidelines.src/evaluators/prompts.json (1)
3-4: LGTM!The new
detection-phaseandsuggestion-phaseprompts are well-structured and align with the two-phase evaluation architecture. The detection prompt specifies a clear markdown output format for parsing, while the suggestion prompt includes appropriate template placeholders ({content},{issues},{criteria}) for runtime substitution.src/providers/openai-provider.ts (1)
218-268: LGTM on error handling and token usage tracking.The error handling properly distinguishes between rate limits, authentication, and general API errors. Token usage mapping correctly extracts
prompt_tokens→inputTokensandcompletion_tokens→outputTokens, maintaining consistency with the structured path.src/providers/anthropic-provider.ts (1)
242-341: Implementation is correct and consistent.The unstructured method properly:
- Omits tool schema and tool_choice from the request
- Reuses the validated response schema which handles both structured and unstructured content blocks
- Correctly filters for text blocks and concatenates them
- Handles all Anthropic-specific error types consistently with the structured method
src/providers/azure-openai-provider.ts (1)
142-222: LGTM - Implementation correctly omits JSON schema for unstructured responses.The key difference from
runPromptStructured(noresponse_formatparameter) is correct for unstructured text output. Response validation, error handling, and usage tracking are consistent with the structured method.ralph/prd.json (1)
1-207: PRD tracking file documents implementation tasks correctly.This is project management metadata that accurately tracks the two-phase architecture implementation. All tasks are marked complete and the JSON structure is valid.
ralph/progress.txt (1)
1-470: Progress log provides comprehensive implementation documentation.This development log accurately documents the incremental implementation of the two-phase architecture. The static analysis hints about "Markdown" capitalization are minor style nits that don't affect the documentation's utility.
src/prompts/schema.ts (1)
160-183: Well-structured type and runtime validation.The
SuggestionLLMResulttype andSUGGESTION_LLM_RESULT_SCHEMAare well-documented and follow the coding guidelines for using Zod schemas for boundary validation of external data.tests/openai-provider.test.ts (3)
1021-1067: Comprehensive unstructured response handling tests.The test suite properly validates the new
runPromptUnstructuredmethod including raw text extraction and token usage tracking. The tests follow the Vitest framework guidelines and use dependency injection with mocked providers.
1069-1117: Verify intentional behavioral difference for empty/null content.The unstructured path returns an empty string for empty/null content (lines 1091, 1116), while the structured path throws
"Empty response from OpenAI API (no content)."(lines 596-598). This asymmetry may be intentional since unstructured responses don't require content, but worth confirming this design decision.
1336-1359: Good verification that unstructured calls omitresponse_format.This test correctly ensures the unstructured path doesn't include
response_format, which would force JSON mode. This is essential for the two-phase architecture where detection uses free-form markdown responses.tests/retry.test.ts (3)
1-11: Solid test foundation for retry utility.The initial tests correctly validate the happy path (first attempt success) and verify the operation is called exactly once. The mock setup is clean and follows Vitest best practices.
28-49: Good coverage of retry exhaustion and default behavior.The tests properly validate that:
- All retries are exhausted before throwing
- The last error is propagated
- Default
maxRetriesof 3 is applied when not specifiedThis aligns with the implementation in
src/evaluators/retry.ts.
112-136: Property test validates retry success within limit.The test effectively demonstrates that the retry mechanism correctly returns successful results when the operation succeeds before exhausting retries. The manual counter approach clearly shows the attempt progression.
tests/detection-phase.test.ts (4)
7-14: Clean mock provider factory for detection phase tests.The mock provider correctly implements only
runPromptUnstructuredsinceDetectionPhaseRunnerexclusively uses unstructured LLM calls. The factory pattern enables easy customization of mock responses per test.
19-52: Thorough single issue parsing test.The test validates all required fields are correctly extracted from the markdown format. The expected output structure matches the
RawDetectionIssueinterface from the implementation.
271-334: Good integration tests for run() method.The tests properly verify:
- LLM provider receives content and built prompt
- Detection results include parsed issues,
hasIssuesflag, raw response, and usage data- Retry logic is appropriately deferred to
retry.test.tsThis follows the testing guideline of focusing on component integration rather than duplicating unit tests.
430-529: Excellent robustness testing for malformed sections.The test at lines 430-529 validates that the parser:
- Skips issues missing required fields (
quotedText,line)- Preserves valid issues even when surrounded by invalid ones
- Correctly identifies 3 valid issues out of 5
This ensures the detection phase gracefully handles LLM output variability.
AGENTS.md (3)
14-17: Good documentation of new evaluator components.The directory structure documentation now accurately reflects the new two-phase architecture files with clear purpose descriptions.
158-198: Excellent two-phase architecture documentation.The documentation clearly explains:
- The purpose and flow of each phase
- Key components with file locations
- Property tests that validate the architecture
This will help contributors understand the evaluation flow quickly.
209-212: Clear provider method documentation.The documentation succinctly describes both LLM provider methods and their use cases, helping developers choose the appropriate method for their phase implementation.
tests/gemini-provider.test.ts (5)
1-20: LGTM! Well-structured mock setup.The SDK mock correctly isolates the provider from network calls while exposing a shared mock function for test control. The hoisted mock pattern with
SHARED_GENERATE_CONTENTenables flexible per-test behavior configuration.
22-45: LGTM! Constructor tests cover essential configuration scenarios.Tests validate both minimal and full configuration options without hitting the network.
47-216: LGTM! Comprehensive unstructured response handling tests.Excellent coverage of edge cases including:
- Whitespace trimming behavior
- Markdown passthrough without modification
- No JSON parsing for unstructured responses
- Graceful handling of null/undefined usage metadata
- Proper error wrapping and propagation
218-258: LGTM! Debug logging tests properly validate conditional behavior.Good use of console spy with proper cleanup in afterEach. The tests verify that debug information is logged only when enabled and includes expected metadata.
323-416: LGTM! Structured response handling tests provide good coverage.Tests validate JSON parsing success, parsing failures, and error propagation with appropriate error messages.
src/evaluators/index.ts (1)
29-51: LGTM! Clean barrel file organization for the two-phase architecture.The exports are well-organized with clear comments explaining each group. Type exports correctly use the
typekeyword, and the ordering maintains the self-registration import at the end.tests/suggestion-phase.test.ts (4)
1-47: LGTM! Well-structured test setup with comprehensive sample data.The mock provider factory and sample issues array provide a solid foundation for testing the suggestion phase. The sample issues cover diverse criterion types (passive voice, vocabulary, wordy phrases).
48-128: LGTM! Run method tests validate core functionality.Tests properly verify the LLM provider interaction, suggestion mapping, and result structure including the
hasSuggestionsconvenience flag.
156-367: LGTM! Thorough property-based tests for suggestion-to-issue matching.Excellent coverage of the matching contract including:
- 1-based indexing semantics
- Partial suggestion sets (LLM may not suggest for all issues)
- Order preservation from LLM response
- Field type validation (positive integer, non-empty strings)
369-465: LGTM! Zod validation tests ensure runtime type safety.Good coverage of validation edge cases ensuring malformed LLM responses are rejected. This provides a safety net against LLM hallucinations producing invalid data structures.
tests/scoring-types.test.ts (4)
1-21: LGTM! Proper mock setup for two-phase evaluation testing.The mock provider correctly implements both
runPromptStructuredandrunPromptUnstructuredmethods. ThebeforeEachhook ensures clean state between tests by clearing mock call history.
40-141: LGTM! Judge evaluation tests validate two-phase integration.The test properly mocks the detection phase with markdown-formatted issues and the suggestion phase with structured suggestions. The assertions verify that:
- Scores are calculated correctly based on violations
- Both criteria are represented in results
- The suggestion phase is appropriately skipped when no issues are detected
183-288: LGTM! Check evaluation tests validate scoring mechanics.The tests properly verify:
- Score calculation based on violation density (
10 - (2 * 2) = 6)- Percentage mapping (
60%)- Perfect score for zero violations
- Suggestion phase optimization (skipped when no issues)
291-344: LGTM! Technical accuracy evaluator test validates claim extraction flow.Good use of
vi.resetModules()andvi.doMock()to isolate the test from other tests' mock configurations. The test properly validates the perfect score path when no claims are extracted.tests/azure-openai-provider.test.ts (4)
1-68: LGTM! Well-structured test setup with hoisted error classes.The
vi.hoistedpattern correctly lifts the error class definitions for use in the mock factory, enabling properinstanceofchecks in tests. The mock structure matches the openai SDK's expected interface.
69-109: LGTM! Clean test setup with dynamic mock wiring.The
beforeEachhook properly wires the mockedvalidateApiResponsefunction, enabling per-test control over validation behavior. Constructor tests cover both minimal and full configuration scenarios.
111-360: LGTM! Comprehensive unstructured response handling tests.Excellent coverage including:
- Raw text extraction and trimming
- Empty and null content error handling
- Markdown passthrough
- Temperature configuration propagation
- Response validation errors
- Unknown error wrapping
421-478: LGTM! Structured response handling test validates JSON parsing and usage mapping.The test properly verifies that structured JSON responses are parsed correctly and usage metadata is mapped to the expected format (
inputTokens,outputTokens).tests/result-assembler.test.ts (5)
1-45: LGTM! Well-structured test helpers and clear test documentation.The test file header clearly documents which properties are being tested (schema conformance and token aggregation), and the helper functions
createDetectionIssues()andcreateSuggestions()provide clean, reusable test fixtures.
60-198: LGTM! Comprehensive schema conformance and edge case coverage.The tests thoroughly verify the
CheckResultstructure including all required fields, proper typing, score ranges, and correct handling of missing or partial suggestions.
230-393: LGTM! Thorough testing of JudgeResult assembly and criterion grouping.The tests properly verify criterion-level structure, violation grouping logic, score calculation based on violation count, and fallback behavior for missing suggestions.
396-460: LGTM! Complete coverage of token usage aggregation scenarios.All edge cases are covered: both inputs undefined, single input defined, both inputs defined, zero values, and large token counts.
462-555: LGTM! Integration tests validate end-to-end result assembly.The integration scenarios effectively combine detection issues, suggestions, and token aggregation to verify the complete workflow for both check and judge evaluation types.
tests/anthropic-provider.test.ts (6)
49-69: LGTM! Clean mock constructor pattern with error class attachment.The mock constructor pattern properly attaches error classes as static properties, enabling
instanceofchecks in production code while keeping tests isolated from the network. This aligns with the coding guidelines for dependency injection in tests.
859-915: LGTM! Proper test setup for unstructured response handling.The test suite follows the established pattern with proper mock setup in
beforeEachand conditional cleanup inafterEach. The first test correctly verifies basic text extraction and token usage.
917-991: LGTM! Tests verify text block concatenation and markdown preservation.These tests ensure multiple text blocks are properly combined and that markdown formatting is preserved as-is, which is essential for the detection phase that returns free-form markdown output.
993-1086: LGTM! Edge case handling for whitespace and missing content.Tests properly verify whitespace trimming and appropriate error throwing when content blocks are empty or contain only non-text blocks.
1088-1141: LGTM! Comprehensive API error handling tests for unstructured calls.The tests verify that different Anthropic error types (APIError, RateLimitError, AuthenticationError) are properly caught and re-thrown with appropriate error messages specific to unstructured calls.
1196-1228: LGTM! Critical test verifying tools are excluded from unstructured calls.This test at lines 1196-1228 is essential for the two-phase architecture—it ensures that
runPromptUnstructureddoes not includetoolsortool_choiceparameters, which would interfere with free-form text generation in the detection phase.src/evaluators/base-evaluator.ts (4)
42-55: LGTM! Clean initialization of two-phase architecture components.The constructor properly initializes all three components needed for the two-phase flow:
DetectionPhaseRunner,SuggestionPhaseRunner, andResultAssembler. The design follows good separation of concerns.
126-191: LGTM! Correct two-phase evaluation flow for judge results.The implementation correctly:
- Runs detection per chunk with criteria context
- Only invokes suggestion phase when issues are detected
- Passes full document (not chunks) to suggestion phase for coherent suggestions
- Properly aggregates token usage from both phases
202-276: LGTM! Check evaluation follows same two-phase pattern with proper severity handling.The severity resolution at line 252 correctly prioritizes
defaultSeverity(constructor param) overprompt.meta.severity, allowing runtime override of prompt-specified severity. The strictness is passed through to the assembler for normalization.
278-296: LGTM! Clean criteria string builder with graceful empty handling.The method produces a readable format for LLM prompts and correctly handles the edge case of empty or undefined criteria.
src/evaluators/suggestion-phase.ts (3)
24-56: LGTM! Well-designed interfaces for suggestion phase.The
Suggestion,SuggestionResult, andSuggestionPhaseOptionsinterfaces are clearly documented and provide a clean API surface. ThehasSuggestionsboolean inSuggestionResultis a helpful convenience property.
164-190: LGTM! Well-formatted issue sections for LLM consumption.The markdown format is clear and structured, with 1-based indexing that correctly corresponds to the
issueIndexexpected in suggestion responses. The"(none)"fallback for empty context is a nice touch.
97-105: Verify:contentparameter passed to bothrunPromptStructuredand embedded in the prompt template.The
contentis embedded in the prompt template viabuildPrompt()(line 91), which replaces the{content}placeholder in the suggestion-phase template. This samecontentis also passed separately torunPromptStructured()(line 100) as the first argument. All LLM providers then receivecontenttwice: once in the system prompt (viabuildPromptBodyForStructured(promptText), which preserves the embedded content) and again in the user message (viaInput:\n\n${content}). This redundancy is consistent across all LLM providers (OpenAI, Gemini, Azure OpenAI, Anthropic). Consider whether this duplication is intentional or should be refactored.tests/base-evaluator-two-phase.test.ts (4)
20-68: LGTM! Well-designed mock fixtures for testing.The
MOCK_PROMPT_FILEandMOCK_PROMPT_FILE_JUDGEprovide reusable test fixtures with realistic structure. TheCREATE_MOCK_LLM_PROVIDERfactory function creates fresh mocks per test, following the dependency injection pattern recommended in the coding guidelines.
320-374: LGTM! Critical test verifying full document context in suggestion phase.This test is essential for the two-phase architecture—it verifies that even when content is chunked for detection, the suggestion phase receives the complete document. The assertion checking
suggestionPhaseContent!.length > 1000provides reasonable validation that the full content was passed.
499-552: LGTM! Tests verify criteria string building through public interface.Good approach testing the private
buildCriteriaStringmethod through the publicevaluateinterface, verifying both populated criteria (with weights) and the graceful fallback for missing criteria.
667-848: LGTM! Comprehensive severity and strictness handling tests.The tests cover all severity precedence levels:
- Default severity when none specified
- Prompt-specified severity
- Constructor parameter override
The strictness normalization test ensures string values like
"strict"are handled correctly.src/evaluators/result-assembler.ts (4)
1-20: LGTM!Clean imports using
import typefor type-only imports, and the module-level documentation clearly explains the two-phase architecture responsibility.
21-33: LGTM!The options interface is well-documented and uses narrow types appropriately. The union type for
strictnessprovides good flexibility while maintaining type safety.
271-288: LGTM!Clean functional approach that correctly handles all combinations of undefined inputs and produces a properly typed result.
333-346: LGTM!The weighted score calculation is correct, with proper handling for the edge case of no criteria. The rounding logic correctly produces one decimal place precision.
src/evaluators/detection-phase.ts (7)
1-16: LGTM!Clear module documentation and proper use of type-only imports following TypeScript best practices.
17-34: LGTM!Well-documented interface with clear field definitions matching the expected LLM response structure.
36-56: LGTM!Both interfaces are well-documented with appropriate optional fields.
69-113: LGTM!The
runmethod follows a clean flow: prompt construction → LLM call with retry → response parsing → result assembly. The conditional assignment ofusageis a good pattern for optional fields.
121-124: LGTM!Simple template replacement. If the template contains multiple
{criteria}placeholders, consider usingreplaceAll()instead, but this is fine if only one occurrence is expected.
135-155: LGTM!The parsing logic correctly handles the "no issues found" case and robustly splits the response into individual issue sections. The use of
slice(1)to skip intro text is appropriate.
163-226: LGTM!The parsing logic correctly extracts fields using well-crafted regex patterns with the
/sflag for multiline content. Good graceful degradation with try-catch that returnsnullfor malformed sections, and proper handling of theunknownerror type per coding guidelines.
| const score: 1 | 2 | 3 | 4 = | ||
| violations.length === 0 | ||
| ? 4 | ||
| : violations.length === 1 | ||
| ? 3 | ||
| : violations.length <= 3 | ||
| ? 2 | ||
| : 1; | ||
|
|
||
| // Build summary | ||
| const summary = | ||
| violations.length === 0 | ||
| ? `Pass: ${criterionName}` | ||
| : `Issue${violations.length > 1 ? "s" : ""} found with ${criterionName}`; | ||
|
|
||
| // Build reasoning | ||
| let reasoning: string; | ||
| if (violations.length === 0) { | ||
| reasoning = `Content meets the ${criterionName} criterion.`; | ||
| } else { | ||
| const parts: string[] = [ | ||
| `${violations.length} violation${violations.length > 1 ? "s" : ""} of ${criterionName} found.`, | ||
| ]; | ||
| for (const v of violations) { | ||
| parts.push(`- "${v.quotedText}": ${v.analysis}`); | ||
| } | ||
| reasoning = parts.join("\n"); | ||
| } | ||
|
|
||
| return { | ||
| name: criterionName, | ||
| weight, | ||
| score, | ||
| normalized_score: score * 2.5, // 1-4 to 1-10 scale |
There was a problem hiding this comment.
Misleading comment: normalized_score range is 2.5–10, not 1–10.
The score ranges from 1 to 4, so score * 2.5 produces values in the range 2.5–10, not 1–10 as stated in the comment. This may confuse maintainers or cause issues if downstream code expects a true 1–10 scale.
If a 1–10 scale is desired, consider adjusting the formula (e.g., (score - 1) * 3 + 1 would map 1→1, 4→10).
Suggested comment fix
- normalized_score: score * 2.5, // 1-4 to 1-10 scale
+ normalized_score: score * 2.5, // 1-4 to 2.5-10 scale🤖 Prompt for AI Agents
In @src/evaluators/result-assembler.ts around lines 206 - 239, The comment
claiming normalized_score maps 1–4 to 1–10 is incorrect: normalized_score =
score * 2.5 yields 2.5–10; update the code in the result assembly to either (A)
correct the comment to state the actual 2.5–10 range for normalized_score, or
(B) change the normalization formula to map 1→1 and 4→10 by replacing score *
2.5 with the linear mapping (score - 1) * 3 + 1 (apply this change where
normalized_score is computed and keep the comment consistent).
| /** | ||
| * Retry utility with logging for LLM operations. | ||
| * | ||
| * Provides exponential backoff retry logic with detailed logging for debugging | ||
| * transient failures in LLM API calls. | ||
| */ |
There was a problem hiding this comment.
Documentation claims exponential backoff, but implementation has none.
The JSDoc states "exponential backoff retry logic" but the implementation immediately retries without any delay between attempts. This could overwhelm APIs during transient failures and defeat the purpose of retrying.
🔧 Proposed fix to add exponential backoff
+const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));
+
export async function withRetry<T>(
operation: () => Promise<T>,
options: RetryOptions
): Promise<T> {
- const { maxRetries = 3, context } = options;
+ const { maxRetries = 3, context, baseDelayMs = 1000 } = options;
let lastError: unknown;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const data = await operation();
if (attempt > 1) {
console.log(
`[vectorlint] ${context}: Success on attempt ${attempt}/${maxRetries}`
);
}
return data;
} catch (error) {
lastError = error;
if (attempt < maxRetries) {
+ const delay = baseDelayMs * Math.pow(2, attempt - 1);
console.log(
- `[vectorlint] ${context}: Attempt ${attempt}/${maxRetries} failed, retrying...`
+ `[vectorlint] ${context}: Attempt ${attempt}/${maxRetries} failed, retrying in ${delay}ms...`
);
+ await sleep(delay);
} else {
console.log(
`[vectorlint] ${context}: All ${maxRetries} attempts exhausted`
);
}
}
}
throw lastError;
}Update RetryOptions interface:
export interface RetryOptions {
/** Maximum number of retry attempts (default: 3) */
maxRetries?: number;
/** Context string for logging (e.g., "detection phase", "suggestion phase") */
context: string;
+ /** Base delay in milliseconds for exponential backoff (default: 1000) */
+ baseDelayMs?: number;
}| const { maxRetries = 3, context } = options; | ||
| let lastError: unknown; |
There was a problem hiding this comment.
Edge case: maxRetries = 0 throws undefined.
If a caller passes maxRetries: 0, the loop never executes and lastError remains undefined, which would throw a confusing undefined error.
🔧 Proposed fix
export async function withRetry<T>(
operation: () => Promise<T>,
options: RetryOptions
): Promise<T> {
const { maxRetries = 3, context } = options;
+
+ if (maxRetries < 1) {
+ throw new Error(`${context}: maxRetries must be at least 1`);
+ }
+
let lastError: unknown;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { maxRetries = 3, context } = options; | |
| let lastError: unknown; | |
| const { maxRetries = 3, context } = options; | |
| if (maxRetries < 1) { | |
| throw new Error(`${context}: maxRetries must be at least 1`); | |
| } | |
| let lastError: unknown; |
🤖 Prompt for AI Agents
In @src/evaluators/retry.ts around lines 30 - 31, Initialize and/or properly
handle the case when maxRetries can be 0 so lastError is never left undefined;
either validate maxRetries to be >= 1 up front or set lastError to a meaningful
default Error before the retry loop (or after the loop check) and then throw
that descriptive error instead of undefined. Update the logic around maxRetries,
lastError, and the retry loop in retry.ts (symbols: maxRetries, lastError, the
retry loop) so callers passing maxRetries: 0 get a clear error message rather
than an undefined throw.
| async runPromptUnstructured(content: string, promptText: string): Promise<LLMResult<string>> { | ||
| const systemPrompt = this.builder.buildPromptBodyForStructured(promptText); | ||
|
|
||
| const fullPrompt = `${systemPrompt} | ||
|
|
||
| Input: | ||
| ${content} | ||
| `; | ||
|
|
||
| if (this.config.debug) { | ||
| console.error('[vectorlint] Sending unstructured request to Gemini:', { | ||
| model: this.config.model, | ||
| temperature: this.config.temperature, | ||
| }); | ||
| if (this.config.showPrompt) { | ||
| console.error('[vectorlint] Full prompt:'); | ||
| console.error(fullPrompt); | ||
| } else if (this.config.showPromptTrunc) { | ||
| console.error('[vectorlint] Prompt preview (first 500 chars):'); | ||
| console.error(fullPrompt.slice(0, 500)); | ||
| if (fullPrompt.length > 500) console.error('... [truncated]'); | ||
| } | ||
| } | ||
|
|
||
| try { | ||
| const result = await this.model.generateContent(fullPrompt); | ||
| const response = result.response; | ||
| const text = response.text(); | ||
| const usageMetadata = response.usageMetadata; | ||
|
|
||
| const llmResult: LLMResult<string> = { data: text.trim() }; | ||
| if (usageMetadata) { | ||
| llmResult.usage = { | ||
| inputTokens: usageMetadata.promptTokenCount ?? 0, | ||
| outputTokens: usageMetadata.candidatesTokenCount ?? 0, | ||
| }; | ||
| } | ||
| return llmResult; | ||
| } catch (e: unknown) { | ||
| const err = handleUnknownError(e, 'Gemini API call'); | ||
| throw new Error(`Gemini API call failed: ${err.message}`); | ||
| } | ||
| } |
There was a problem hiding this comment.
Model configured for JSON output is reused for unstructured calls.
The this.model instance is initialized in the constructor with responseMimeType: "application/json" (line 37), which instructs Gemini to always return JSON-formatted output. When runPromptUnstructured calls this.model.generateContent(fullPrompt), the API will still attempt to format the response as JSON, which conflicts with the detection phase's expectation of free-form markdown text.
🔧 Suggested fix: Create an unstructured model on-demand
async runPromptUnstructured(content: string, promptText: string): Promise<LLMResult<string>> {
const systemPrompt = this.builder.buildPromptBodyForStructured(promptText);
const fullPrompt = `${systemPrompt}
Input:
${content}
`;
if (this.config.debug) {
console.error('[vectorlint] Sending unstructured request to Gemini:', {
model: this.config.model,
temperature: this.config.temperature,
});
if (this.config.showPrompt) {
console.error('[vectorlint] Full prompt:');
console.error(fullPrompt);
} else if (this.config.showPromptTrunc) {
console.error('[vectorlint] Prompt preview (first 500 chars):');
console.error(fullPrompt.slice(0, 500));
if (fullPrompt.length > 500) console.error('... [truncated]');
}
}
try {
+ // Create a model without JSON response type for unstructured text output
+ const unstructuredModel = this.client.getGenerativeModel({
+ model: this.config.model!,
+ generationConfig: {
+ ...(this.config.temperature !== undefined && { temperature: this.config.temperature }),
+ }
+ });
- const result = await this.model.generateContent(fullPrompt);
+ const result = await unstructuredModel.generateContent(fullPrompt);
const response = result.response;
const text = response.text();🤖 Prompt for AI Agents
In @src/providers/gemini-provider.ts around lines 96 - 138, The model instance
created in the constructor was set with responseMimeType: "application/json"
causing runPromptUnstructured to still request JSON; update
runPromptUnstructured to create or obtain a temporary unstructured model (e.g.,
clone or instantiate a new Model/Client without responseMimeType or with
responseMimeType: "text/markdown" or "text/plain") and call generateContent on
that unstructured model instead of this.model; ensure you reuse the same
client/config values (model name, temperature) and clean up or reuse the
temporary model reference as appropriate to avoid changing the
constructor-initialized JSON model used elsewhere.
| throw new Error('Empty response from OpenAI API (no choices).'); | ||
| } | ||
|
|
||
| const responseText = firstChoice.message.content?.trim() ?? ''; |
There was a problem hiding this comment.
Inconsistent empty content handling vs. structured path.
The structured path (line 155-157) throws an error when responseText is empty, but the unstructured path falls back to an empty string with ?? ''. This inconsistency could mask issues where the LLM returns no content.
🔧 Suggested fix for consistency
- const responseText = firstChoice.message.content?.trim() ?? '';
+ const responseText = firstChoice.message.content?.trim();
+ if (!responseText) {
+ throw new Error('Empty response from OpenAI API (no content).');
+ }Alternatively, if empty responses are valid for unstructured calls, document this intentional difference.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const responseText = firstChoice.message.content?.trim() ?? ''; | |
| const responseText = firstChoice.message.content?.trim(); | |
| if (!responseText) { | |
| throw new Error('Empty response from OpenAI API (no content).'); | |
| } |
| describe('Debugging and Logging', () => { | ||
| let consoleSpy: ReturnType<typeof vi.spyOn>; | ||
|
|
||
| beforeEach(() => { | ||
| consoleSpy = vi.spyOn(console, 'log').mockImplementation(() => { }); | ||
| }); | ||
|
|
||
| afterEach(() => { | ||
| consoleSpy.mockRestore(); | ||
| }); | ||
|
|
||
| it('logs debug information for unstructured calls', async () => { | ||
| const config = { | ||
| apiKey: 'sk-test-key', | ||
| endpoint: 'https://test.openai.azure.com', | ||
| deploymentName: 'gpt-4o', | ||
| debug: true, | ||
| }; | ||
|
|
||
| const mockResponse: OpenAIResponse = { | ||
| choices: [ | ||
| { | ||
| message: { | ||
| content: 'Response text', | ||
| }, | ||
| finish_reason: 'stop', | ||
| }, | ||
| ], | ||
| usage: { | ||
| prompt_tokens: 50, | ||
| completion_tokens: 20, | ||
| total_tokens: 70, | ||
| }, | ||
| }; | ||
|
|
||
| SHARED_CREATE.mockResolvedValue(mockResponse); | ||
|
|
||
| const provider = new AzureOpenAIProvider(config); | ||
| await provider.runPromptUnstructured('Test content', 'Test prompt'); | ||
|
|
||
| expect(consoleSpy).toHaveBeenCalledWith( | ||
| '[vectorlint] Sending unstructured request to Azure OpenAI:', | ||
| expect.objectContaining({ | ||
| model: 'gpt-4o', | ||
| temperature: undefined, | ||
| }) | ||
| ); | ||
|
|
||
| expect(consoleSpy).toHaveBeenCalledWith( | ||
| '[vectorlint] LLM response meta:', | ||
| expect.objectContaining({ | ||
| // eslint-disable-next-line @typescript-eslint/no-unsafe-assignment | ||
| usage: expect.anything(), | ||
| finish_reason: 'stop', | ||
| }) | ||
| ); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
echo "=== Checking console logging in provider implementations ==="
rg -n "console\.(log|error|warn)" src/providers/ --type ts -C 2Repository: TRocket-Labs/vectorlint
Length of output: 16050
Standardize debug logging across all providers.
The Azure test correctly spies on console.log, which matches the Azure OpenAI implementation. However, the Gemini provider is the only one using console.error for debug logging, while Azure, OpenAI, Anthropic, and Perplexity all use console.log. Align the Gemini provider to use console.log for consistency across all providers.
🤖 Prompt for AI Agents
In @tests/azure-openai-provider.test.ts around lines 362 - 419, The Gemini
provider is logging debug output with console.error while other providers use
console.log; locate the Gemini provider class (e.g., GeminiProvider) and replace
any debug console.error calls in its debug/logging code paths (methods like
runPromptUnstructured / runPromptStructured or any place using console.error for
"[vectorlint]" debug messages) with console.log, preserving the exact log
message and payload shape so tests and consumers remain consistent.
Two-Phase Evaluation Architecture
Summary
Splits LLM evaluation into two distinct phases—detection and suggestion—to improve output quality and reduce hallucinations.
Problem
The previous single-pass approach asked the LLM to simultaneously detect issues AND generate suggestions in one structured call. This led to:
Solution
Phase 1: Detection — Uses an unstructured LLM call with free-form markdown output. The model focuses solely on finding issues with detailed reasoning.
Phase 2: Suggestion — Uses a structured JSON schema call. The model receives the full document context and detected issues, then generates targeted suggestions for each.
Changes
runPromptUnstructured()to all providersDetectionPhaseRunnerwith markdown parserSuggestionPhaseRunnerwith Zod validationResultAssemblermerges phases + aggregates tokenswithRetry()utility for transient failuresStats
Testing
Breaking Changes
None. Output format remains backward compatible.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.