-
Notifications
You must be signed in to change notification settings - Fork 0
Schema redesign: SARIF-aligned file findings, ruleId foreign keys, local-only scope #5
Description
Problem
The current output schema uses flat category + source fields on findings, separate configured_categories / unconfigured_categories / os_discovered top-level fields, and no schema versioning. The schema needs to formalize the rules-to-findings relationship, support attribution-based test assertions, and align with SARIF where the data naturally fits.
Design
The full proposed schema is documented in docs/agents/privacy-guard/SCHEMA-PROPOSAL.md. Key changes:
SARIF-aligned file findings
File-based findings (working tree, staged, HEAD, historical commit diffs, gitignored files) all have a file path and line number. These are natively SARIF-compliant. The output uses SARIF's physicalLocation model for these findings. Historical commit diffs include a commit metadata field — the file + line is the SARIF part, the commit context is additional metadata.
Native format for non-file findings
Commit messages, branch names, tag names, and stash descriptions are not files. SARIF's logicalLocation with custom kind values can technically represent them, but standard SARIF viewers won't render them. These findings use our native flat format with direct location, location_type, matched_value fields — no nesting, no properties bags.
ruleId foreign keys
Findings reference rules via a string ruleId (format: category:source, e.g., "emails:person_md_frontmatter"). Rules are defined once in tool.rules[] with id, category, source, and count. A category can have rules from multiple sources.
Schema versioning
version field at the top level so consumers know which shape to expect.
Scope narrowing to local-only
Remove GitHub issues and PRs from the default scan scope. The agent's core question is "can I commit and push safely right now?" — that's local. Remote content scanning is a separate concern. This also simplifies the schema by removing issue and pr location types.
Work breakdown
- Add
versionfield to agent output - Replace
configured_categories/unconfigured_categories/os_discoveredwithtool.rules[]array - Replace
category+sourceon findings withruleIdstring referencingtool.rules[] - Remove issue/PR scanning from agent definition and scan scope
- Update
scan_scopeto removeissues_checked/prs_checked - Update existing tests for new schema shape
- Add tests asserting on
ruleIdandtool.rules[]structure - Update SCHEMA.md to reflect the new schema (move current to SCHEMA.md, archive or remove SCHEMA-PROPOSAL.md gap analysis)
- Update README.md and CONTRIBUTING.md as needed
Open questions
- Should
categoryremain as a convenience field on findings alongsideruleId? Avoids parsing theruleIdstring for simple filtering. - Rule lifecycle across runs: if PERSON.md changes between scans, rules change. Multi-repo scanning (issue Tracking: open work from initial privacy-guard build session #1 item 3) will need run-scoping. Not blocking for initial implementation.
- Suppressions / false positive tracking: SARIF has
suppressionson results. Consider whether findings should carry a suppression state for user-acknowledged false positives. Not blocking for initial implementation.
Related
docs/agents/privacy-guard/SCHEMA-PROPOSAL.md— full design with SARIF gap analysis and test ergonomics rationaledocs/agents/privacy-guard/SCHEMA.md— current schema- Agent should own PII categories and reason from any input, not just structured YAML #2 — agent should own PII categories (attribution, test matrix)
- Update debug-agent-tests skill: log review as verification on every run #3 — validate-privacy-guard skill update
- Agent interface contract: parent-facing skill, schema enforcement, and discovery #4 — agent interface contract (parent skill,
--json-schema)