feat: implement 8 HumanLayer-informed audit improvements#17
feat: implement 8 HumanLayer-informed audit improvements#17WellDunDun wants to merge 4 commits intomasterfrom
Conversation
… command Apply extracted wisdom from HumanLayer's harness engineering analysis to expand the Reins scoring model and CLI capabilities. Back-pressure and hooks promoted from findings to scored points (max 22→24), AGENTS.md uses tiered evaluation, MCP tool proliferation detected, language-agnostic back-pressure covers JS/Python/ Rust/Make, plugin system via .reins/custom-checks.json, and new compare command for tracking audit progression over time. All docs updated to reflect new thresholds and scoring dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
📝 WalkthroughWalkthroughThis PR expands the audit scoring from 0–22 to 0–24, adds runtime detection signals (hooks, back-pressure, MCP counts, custom checks), adds new scoring hooks and MATURITY_THRESHOLDS, implements a Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/design-docs/symphony-integration.md (1)
14-20:⚠️ Potential issue | 🟡 MinorScoring deltas in this section are internally inconsistent with the final model.
Please align this block to the final thresholds/maxima used in this PR (
agent_workflowmax 6 and total max shift 22→24).🧩 Suggested consistency patch
-- **Scoring**: `agent_workflow` dimension expanded from max 4 to max 5, adding orchestration readiness (3+ of: workflow config, skills directory, isolation policy, concurrency limits, merge protection) +- **Scoring**: `agent_workflow` dimension expanded from max 5 to max 6, adding orchestration readiness (3+ of: workflow config, skills directory, isolation policy, concurrency limits, merge protection) ... -- **Total max score**: 24 (was 21) +- **Total max score**: 24 (was 22)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/symphony-integration.md` around lines 14 - 20, The scoring summary is inconsistent with the PR's final thresholds: update the "Scoring" line to show agent_workflow expanded from max 4 to max 6 (not 5), and update the "Total max score" line to reflect the PR's final shift from 22→24 (total max 24); ensure the rest of the block still lists the new detection signals in context.ts (hasWorkflowConfig, hasSkillsDirectory, hasIsolationPolicy, hasConcurrencyLimits, hasMergeProtection, hasSpecDocument, frameworksDetected), the three Doctor checks, the Evolve steps, Init scaffolds (WORKFLOW.md, SPEC.md), and the new frameworks_detected audit field so the section matches the final model.skill/Reins/HarnessMethodology.md (1)
13-16:⚠️ Potential issue | 🟡 MinorAudit scale reference is stale in this document.
The command mapping still says
reins auditscores(0-22), which is now outdated for this PR.📝 Suggested fix
-- `reins audit` — score maturity across six dimensions (0-22) +- `reins audit` — score maturity across six dimensions (0-24)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skill/Reins/HarnessMethodology.md` around lines 13 - 16, Update the stale audit-scale text for the `reins audit` mapping in HarnessMethodology.md to match the real, current scoring range used by the implementation; find the authoritative value in the `reins audit` command implementation or its scoring constant (search for symbols like AUDIT_MAX_SCORE, maxScore, computeAuditScore, or scoreAudit) and replace “(0-22)” with the actual range (e.g., “(0-<actual_max>)”) so the doc and code are consistent.cli/reins/package.json (1)
40-44:⚠️ Potential issue | 🟠 Major
typecheckscript fails withouttypescriptin devDependencies.The npm script
"typecheck": "tsc --noEmit"invokestscdirectly, buttypescriptis not declared in this package's dependencies. Developers runningnpm run typecheckorbun run typecheckwill fail with "command not found: tsc". CI avoids this by explicitly callingbunx tsc --noEmitinstead of using the npm script.Use
bunx tsc --noEmitin the script to ensuretscis available:Suggested fix
"scripts": { "start": "bun src/index.ts", "test": "bun test", - "typecheck": "tsc --noEmit" + "typecheck": "bunx tsc --noEmit" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/reins/package.json` around lines 40 - 44, Update the package.json "typecheck" npm script so it does not assume a global tsc binary is installed; replace the current command "tsc --noEmit" with "bunx tsc --noEmit" (i.e., edit the "typecheck" script entry) so running npm run typecheck or bun run typecheck works without adding typescript to devDependencies.
🧹 Nitpick comments (1)
skill/Reins/SKILL.md (1)
68-70: Clarifycomparearguments to match baseline workflow semantics.Using
<path1> <path2>is ambiguous here; naming the second argument as a baseline audit artifact is clearer for users and agents.✏️ Suggested doc tweak
-# Compare audits across repos or over time -reins compare <path1> <path2> +# Compare current audit against a baseline +reins compare <path> <baseline.json>Based on learnings
cli/reinsis the product engine and the only source of truth for readiness scoring and JSON outputs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skill/Reins/SKILL.md` around lines 68 - 70, Update the documentation for the reins compare command to explicitly name and describe the arguments (e.g., `reins compare <current-audit-path> <baseline-audit-path>`), clarifying that the second argument is a baseline audit artifact used for diffing against the current audit; update the example and any surrounding text referencing `reins compare` in SKILL.md so it reflects these baseline workflow semantics and that cli/reins is the source of truth for readiness scoring and JSON outputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/reins/src/lib/audit/scoring.ts`:
- Around line 537-545: The code constructs RegExp from user input (new
RegExp(check.pattern, "i")) which can cause ReDoS; update scoring.ts to validate
and/or sandbox regex before using it: refuse or warn on overly-complex patterns
by running check.pattern through a safe-regex/regexp-tree complexity check (or
length/quantifier heuristics) and if unsafe push a finding like the existing
invalid-regex message; alternatively, execute the actual match within a
time-limited sandbox (worker thread or AbortController-based timeout) and catch
timeouts to record a finding instead of allowing the CLI to hang. Ensure you
reference and update the creation/usage points around the regex variable and the
matching logic that uses check.pattern so unsafe patterns are rejected or
timeboxed.
---
Outside diff comments:
In `@cli/reins/package.json`:
- Around line 40-44: Update the package.json "typecheck" npm script so it does
not assume a global tsc binary is installed; replace the current command "tsc
--noEmit" with "bunx tsc --noEmit" (i.e., edit the "typecheck" script entry) so
running npm run typecheck or bun run typecheck works without adding typescript
to devDependencies.
In `@docs/design-docs/symphony-integration.md`:
- Around line 14-20: The scoring summary is inconsistent with the PR's final
thresholds: update the "Scoring" line to show agent_workflow expanded from max 4
to max 6 (not 5), and update the "Total max score" line to reflect the PR's
final shift from 22→24 (total max 24); ensure the rest of the block still lists
the new detection signals in context.ts (hasWorkflowConfig, hasSkillsDirectory,
hasIsolationPolicy, hasConcurrencyLimits, hasMergeProtection, hasSpecDocument,
frameworksDetected), the three Doctor checks, the Evolve steps, Init scaffolds
(WORKFLOW.md, SPEC.md), and the new frameworks_detected audit field so the
section matches the final model.
In `@skill/Reins/HarnessMethodology.md`:
- Around line 13-16: Update the stale audit-scale text for the `reins audit`
mapping in HarnessMethodology.md to match the real, current scoring range used
by the implementation; find the authoritative value in the `reins audit` command
implementation or its scoring constant (search for symbols like AUDIT_MAX_SCORE,
maxScore, computeAuditScore, or scoreAudit) and replace “(0-22)” with the actual
range (e.g., “(0-<actual_max>)”) so the doc and code are consistent.
---
Nitpick comments:
In `@skill/Reins/SKILL.md`:
- Around line 68-70: Update the documentation for the reins compare command to
explicitly name and describe the arguments (e.g., `reins compare
<current-audit-path> <baseline-audit-path>`), clarifying that the second
argument is a baseline audit artifact used for diffing against the current
audit; update the example and any surrounding text referencing `reins compare`
in SKILL.md so it reflects these baseline workflow semantics and that cli/reins
is the source of truth for readiness scoring and JSON outputs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 825f348b-7508-4a1f-b7b3-2d83ee48122a
📒 Files selected for processing (17)
README.mdcli/reins/README.mdcli/reins/package.jsoncli/reins/src/index.test.tscli/reins/src/index.tscli/reins/src/lib/audit/context.tscli/reins/src/lib/audit/scoring.tscli/reins/src/lib/commands/compare.tscli/reins/src/lib/commands/evolve.test.tscli/reins/src/lib/commands/evolve.tscli/reins/src/lib/types.tsdocs/design-docs/core-beliefs.mddocs/design-docs/symphony-integration.mdpackage.jsonskill/Reins/HarnessMethodology.mdskill/Reins/SKILL.mdskill/Reins/Workflows/Audit.md
Reject patterns with nested quantifiers (e.g. (a+)+$) before constructing RegExp to prevent catastrophic backtracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/reins/src/lib/audit/scoring.ts`:
- Around line 522-564: The scoreCustomChecks function is over-complex; split the
logic for each check.type into small helpers to reduce nesting and cognitive
complexity. Extract and call at least handleFileExistsCheck(ctx, check, result)
to handle the "file-exists" branch and handleFileContainsCheck(ctx, check,
result) to encapsulate path resolution, existsSync, file read, ReDoS pattern
guard, regex construction (separately catching invalid RegExp) and test/result
push; optionally add validatePattern(check.pattern) to centralize the
nested-quantifier check. Ensure helpers reference the same symbols
(ctx.customChecks, result.scores, check.dimension) and preserve the exact
result.scores[dimension].findings messages for missing files, invalid regex,
rejected patterns, read errors, pattern found/not found.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d28a4cb7-104a-4e5e-b54d-ab50ff98f6e2
📒 Files selected for processing (1)
cli/reins/src/lib/audit/scoring.ts
Split scoreCustomChecks into handleFileExistsCheck and handleFileContainsCheck helpers. Resolves Biome lint error (cognitive complexity 25 > 15 limit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cli/reins/src/lib/audit/scoring.ts (1)
1-2:⚠️ Potential issue | 🟡 MinorCI failure: Run formatter to fix style issues.
The pipeline reports formatting differences. Run the project's formatter (e.g.,
bun formatorbiome format --write) to resolve before merge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/reins/src/lib/audit/scoring.ts` around lines 1 - 2, The CI failure is due to formatting differences in cli/reins/src/lib/audit/scoring.ts; run the project's formatter (e.g., bun format or biome format --write) on the repository so imports like existsSync, readFileSync, readdirSync and join and the rest of scoring.ts are reformatted to match the project's style and resolve the pipeline error.
🧹 Nitpick comments (1)
cli/reins/src/lib/audit/scoring.ts (1)
544-550: ReDoS guard may produce false positives for non-greedy quantifiers.The pattern
/([+*?]\)?[+*?]|(\.\*){3,})/catches dangerous nested quantifiers, but it may also reject legitimate non-greedy patterns likea+?or.*?since+?matches the[+*?]\)?[+*?]portion.Since this is repo-owner-authored config, false positives are low risk — users can simply adjust their patterns. However, if you want to be more precise:
🔧 Optional: More precise ReDoS detection
// Guard against ReDoS: reject nested quantifiers like (a+)+ - if (/([+*?]\)?[+*?]|(\.\*){3,})/.test(check.pattern)) { + // Detect: (X+)+, (X*)+, (X+)*, etc. but allow non-greedy like a+? + if (/\([^)]*[+*]\)[+*]|([+*])\1/.test(check.pattern)) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/reins/src/lib/audit/scoring.ts` around lines 544 - 550, The current ReDoS guard in the block that checks check.pattern is too broad and flags non-greedy quantifiers like +? and .*?; update the regex used there so it does not match a quantifier immediately followed by the non-greedy "?" (i.e., only detect nested/grotesque quantifiers when the first quantifier is not followed by ?), then keep the existing behavior of pushing `[custom] ${check.name}: regex pattern rejected (nested quantifiers may cause ReDoS)` into result.scores[dimension].findings when a true nested-quantifier case is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cli/reins/src/lib/audit/scoring.ts`:
- Around line 1-2: The CI failure is due to formatting differences in
cli/reins/src/lib/audit/scoring.ts; run the project's formatter (e.g., bun
format or biome format --write) on the repository so imports like existsSync,
readFileSync, readdirSync and join and the rest of scoring.ts are reformatted to
match the project's style and resolve the pipeline error.
---
Nitpick comments:
In `@cli/reins/src/lib/audit/scoring.ts`:
- Around line 544-550: The current ReDoS guard in the block that checks
check.pattern is too broad and flags non-greedy quantifiers like +? and .*?;
update the regex used there so it does not match a quantifier immediately
followed by the non-greedy "?" (i.e., only detect nested/grotesque quantifiers
when the first quantifier is not followed by ?), then keep the existing behavior
of pushing `[custom] ${check.name}: regex pattern rejected (nested quantifiers
may cause ReDoS)` into result.scores[dimension].findings when a true
nested-quantifier case is detected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 21e489f8-40c7-4850-adb9-0f5619c4f593
📒 Files selected for processing (1)
cli/reins/src/lib/audit/scoring.ts
Summary
Implement all 8 proposed Reins CLI improvements derived from HumanLayer's harness engineering wisdom. Expands scoring model (22→24), promotes back-pressure and hooks to scored dimensions, adds compare command, language-agnostic back-pressure detection, and custom check plugins.
Changes
reins compare <path> <baseline.json>command for tracking audit progression.reins/custom-checks.json(file-exists, file-contains checks)Testing
bun testpasses (80/80)Merge Readiness
Audit Impact
Before: max_score 22 with L3 threshold at 19. After: max_score 24 with L3 threshold at 21. Back-pressure and hooks now directly contribute to scores rather than just findings.
Summary by CodeRabbit
New Features
Updates
Tests
Chores