Skip to content

Schema redesign: SARIF-aligned file findings, ruleId foreign keys, local-only scope #5

@krisrowe

Description

@krisrowe

Problem

The current output schema uses flat category + source fields on findings, separate configured_categories / unconfigured_categories / os_discovered top-level fields, and no schema versioning. The schema needs to formalize the rules-to-findings relationship, support attribution-based test assertions, and align with SARIF where the data naturally fits.

Design

The full proposed schema is documented in docs/agents/privacy-guard/SCHEMA-PROPOSAL.md. Key changes:

SARIF-aligned file findings

File-based findings (working tree, staged, HEAD, historical commit diffs, gitignored files) all have a file path and line number. These are natively SARIF-compliant. The output uses SARIF's physicalLocation model for these findings. Historical commit diffs include a commit metadata field — the file + line is the SARIF part, the commit context is additional metadata.

Native format for non-file findings

Commit messages, branch names, tag names, and stash descriptions are not files. SARIF's logicalLocation with custom kind values can technically represent them, but standard SARIF viewers won't render them. These findings use our native flat format with direct location, location_type, matched_value fields — no nesting, no properties bags.

ruleId foreign keys

Findings reference rules via a string ruleId (format: category:source, e.g., "emails:person_md_frontmatter"). Rules are defined once in tool.rules[] with id, category, source, and count. A category can have rules from multiple sources.

Schema versioning

version field at the top level so consumers know which shape to expect.

Scope narrowing to local-only

Remove GitHub issues and PRs from the default scan scope. The agent's core question is "can I commit and push safely right now?" — that's local. Remote content scanning is a separate concern. This also simplifies the schema by removing issue and pr location types.

Work breakdown

  • Add version field to agent output
  • Replace configured_categories / unconfigured_categories / os_discovered with tool.rules[] array
  • Replace category + source on findings with ruleId string referencing tool.rules[]
  • Remove issue/PR scanning from agent definition and scan scope
  • Update scan_scope to remove issues_checked / prs_checked
  • Update existing tests for new schema shape
  • Add tests asserting on ruleId and tool.rules[] structure
  • Update SCHEMA.md to reflect the new schema (move current to SCHEMA.md, archive or remove SCHEMA-PROPOSAL.md gap analysis)
  • Update README.md and CONTRIBUTING.md as needed

Open questions

  • Should category remain as a convenience field on findings alongside ruleId? Avoids parsing the ruleId string for simple filtering.
  • Rule lifecycle across runs: if PERSON.md changes between scans, rules change. Multi-repo scanning (issue Tracking: open work from initial privacy-guard build session #1 item 3) will need run-scoping. Not blocking for initial implementation.
  • Suppressions / false positive tracking: SARIF has suppressions on results. Consider whether findings should carry a suppression state for user-acknowledged false positives. Not blocking for initial implementation.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions