Provenance Composition Model

Schema version: v1.1.8

The PCM records how files changed (where, by whom, and by what kind of action) without storing your source code.

Why PCM exists

Teams want trustworthy insight into file composition (e.g., human vs. AI effort, paste vs. edit) without risking source code exposure. PCM captures events about edits and produces per-file snapshots with counts and ranges—never raw text.

What PCM tracks (at a glance)

Events: Each edit is an event (insert, replace, delete, paste, AI apply, format, tooling).
Where: Byte ranges for "before" and "after" (precise positions inside a file).
How much: Lines and character counts, plus a content hash (no text).
Who: An actor (e.g., user/bot/system), using opaque identifiers.
Origin: Was it human, ai, or untracked?
Category (PCM 1.1.8): Broader classification: human, automation, preexisting, or out_of_band.
Subtype (PCM 1.1.8): Specific type when determinable: ai, ai_assisted, tooling, format, generator, bootstrap, codemod.
- ai — AI suggestion/application
- human — typing, paste, manual edits
- observed — observed tool output
- untracked — when origin can't be determined
- external — external edits or unknown attribution

Key terms (plain English)

Event — One edit operation to a file.
Actor — Who performed the edit (e.g., a user or a tool); use non-sensitive IDs.
Origin — Immediate source classification: human, ai, untracked, or observed (ingest-only).
Category (PCM 1.1.8) — Broader classification: human (deliberate work), automation (AI/tooling), preexisting (baseline), out_of_band (external).
Subtype (PCM 1.1.8) — Specific type when determinable: ai, ai_assisted, tooling, format, generator, bootstrap, codemod.
Snapshot — A per-file JSON summary under .coderoot/v1/snapshots/ that shows spans and totals by origin, category, and subtype.
Span — A region of a file (by byte range) with an origin, category, and timestamps.

Operations (cheat sheet)

Operation	Typical meaning	Size fields present
`insert`	New content added	`introduced`
`replace`	Old content replaced by new	`deleted` + `introduced`
`delete`	Content removed	`deleted`
`paste`	Pasted content (treated as a human action)	`introduced`
`ai_apply`	AI suggestion applied	`introduced` (and sometimes `deleted`)
`format`	Automated formatting	Usually size-neutral
`tooling`	Tool-driven change (e.g., refactor)	Varies
`rename`	File renamed	N/A
`move`	File moved	N/A

Actors, origin, category & subtype (practical guidance)

Actor: keep it simple and private—opaque ID, optional display name (no emails/tokens).
Origin: Immediate source classification
- ai — AI suggestion/application
- human — typing, paste, manual edits
- observed — observed tool output
- untracked — when origin can't be determined
- external — external edits or unknown attribution
Category (PCM 1.1.8): Broader classification system
- human — Deliberate human work (typing, paste, manual edits)
- automation — AI-assisted and trusted automation (AI applies, formatters, tooling)
- preexisting — Content present at workspace initialization
- out_of_band — External edits introduced outside IDE attribution
Subtype (PCM 1.1.8): Specific type when determinable
- ai — Direct AI-generated code
- ai_assisted — AI suggestions requiring human confirmation
- tooling — Automated tool operations (linters, refactors)
- format — Formatting-only changes
- generator — Code generation tools
- bootstrap — Initial project setup
- codemod — Automated code transformations

Snapshots (what readers actually use)

Every tracked file can have a snapshot at: .coderoot/v1/snapshots/<relative-path>.pcm.json

A snapshot includes:

Spans: regions with an origin, category, and timestamps
Summary: Multiple aggregation views:
- lines_by_origin / chars_by_origin — By immediate source (human, ai, untracked)
- lines_by_category / chars_by_category — By broader classification (human, automation, preexisting, out_of_band)
- lines_by_subtype / chars_by_subtype — By specific type (ai, tooling, format, etc.)
- lines_by_bucket / chars_by_bucket (PCM 1.1.8) — Category-based rollups
Metadata: File information, encoding, replay checkpoints for incremental processing

These power reports (e.g., "% human vs. AI", "% automation", breakdowns by subtype) without exposing any code.

What PCM deliberately does not store

❌ No source code text
❌ No clipboard contents
❌ No credentials or personal emails
❌ No tool internals or proprietary CI details

Minimal redacted examples

Event (illustrative):

{
  "schema_version": "1.1.8",
  "record_type": "pcm_event",
  "event_id": "e-123",
  "file_path": "src/example.txt",
  "op": "insert",
  "origin": "human",
  "actor": { "id": "u-abc" },
  "after": { "range": { "startByte": 0, "endByte": 12 } },
  "introduced": {
    "lines": 2,
    "chars_total": 12,
    "hash": { "algo": "ws-sha256", "value": "…" }
  },
  "provenance": {
    "category": "human",
    "reason_code": "op:typing",
    "evidence": ["op:typing"]
  }
}

AI Apply Event (illustrative):

{
  "schema_version": "1.1.8",
  "record_type": "pcm_event",
  "event_id": "e-456",
  "file_path": "src/example.txt",
  "op": "ai_apply",
  "origin": "ai",
  "actor": { "id": "u-abc" },
  "after": { "range": { "startByte": 12, "endByte": 50 } },
  "introduced": {
    "lines": 5,
    "chars_total": 38,
    "hash": { "algo": "ws-sha256", "value": "…" }
  },
  "provenance": {
    "category": "automation",
    "subtype": "ai",
    "reason_code": "op:ai_apply",
    "evidence": ["op:ai_apply", "suggestion_accepted"]
  }
}

Snapshot (illustrative):

{
  "schema_version": "1.1.8",
  "file_path": "src/example.txt",
  "file_id": "file-abc123",
  "updated_at": "2025-10-07T00:00:00Z",
  "spans": [
    { "span_id": "s-1",
      "range": {"startByte": 0, "endByte": 12},
      "origin": "human",
      "category": "human",
      "introduced_at": "2025-10-07T00:00:00Z",
      "last_modified_at": "2025-10-07T00:00:00Z" },
    { "span_id": "s-2",
      "range": {"startByte": 12, "endByte": 50},
      "origin": "ai",
      "category": "automation",
      "introduced_at": "2025-10-07T00:05:00Z",
      "last_modified_at": "2025-10-07T00:05:00Z" }
  ],
  "summary": {
    "lines_total": 7,
    "lines_by_origin": { "human": 2, "ai": 5, "untracked": 0, "observed": 0, "external": 0 },
    "chars_by_origin": { "human": 12, "ai": 38, "untracked": 0, "observed": 0, "external": 0 },
    "lines_by_category": { "human": 2, "automation": 5, "preexisting": 0, "out_of_band": 0 },
    "chars_by_category": { "human": 12, "automation": 38, "preexisting": 0, "out_of_band": 0 },
    "lines_by_subtype": { "ai": 5, "ai_assisted": 0, "tooling": 0, "format": 0, "generator": 0, "bootstrap": 0, "codemod": 0 },
    "chars_by_subtype": { "ai": 38, "ai_assisted": 0, "tooling": 0, "format": 0, "generator": 0, "bootstrap": 0, "codemod": 0 },
    "lines_by_bucket": { "human": 2, "automation": 5, "preexisting": 0, "out_of_band": 0 },
    "chars_by_bucket": { "human": 12, "automation": 38, "preexisting": 0, "out_of_band": 0 },
    "touched": 2,
    "last_modified_at": "2025-10-07T00:05:00Z"
  },
  "meta": {
    "replay_checkpoint": {
      "schema_applied": "1.1.8",
      "processed_through_event_id": "e-456",
      "processed_through_ts": "2025-10-07T00:05:00Z"
    }
  }
}

Note: Hashes are shown as … on purpose. PCM uses hashes to verify content without storing it.

Quick-start: verify PCM locally (2 minutes)

Make a tiny change in any file (add one short line).
Generate snapshots with your editor integration or CLI.
Open the snapshot at .coderoot/v1/snapshots/<relative-path>.pcm.json.
Confirm:
- summary.lines_total increased as expected
- lines_by_origin.human (or ai) reflects your change
- lines_by_category and lines_by_bucket show category breakdowns
- lines_by_subtype shows specific types when determinable
- No raw text—only counts, ranges, hashes

Privacy & safe-use tips

Prefer hash-only handling for clipboard-related data.
Keep actor identifiers opaque and local to the repo.
Review snapshots locally; publish only when numbers match expectations.

Origins vs Categories vs Buckets

Origins represent the immediate source:

human: Direct human input
ai: AI-generated content
observed: Observed tool output (ingest-only, coerced to untracked in snapshots)
untracked: When origin can't be determined
external: External edits

Categories represent the broader classification:

human: Deliberate human work
automation: AI-assisted and trusted automation
preexisting: Baseline content from workspace initialization
out_of_band: External edits needing review

Buckets (PCM 1.1.8) are category-based rollups that sum to the same totals as origin-based fields. They provide an alternative view for reporting and analysis.

Version & compatibility

Schema: v1.1.8
Readers are tolerant of earlier 1.1.x data (1.1.3–1.1.7) and will normalize older field names where reasonable.
PCM 1.1.8 introduces category/subtype classification and bucket rollups while maintaining backward compatibility with origin-based fields.
This page is a public summary. Implementation details live in the (separate) spec and schema files included in this repo.

FAQ (short)

Does PCM store my code? No. PCM stores ranges, counts, and hashes—never raw text.

What if I paste content? It's recorded as a paste operation, and the origin is human.

What if AI applies a change? It's recorded as ai_apply with origin ai.

Why byte ranges? They're precise and resilient to line ending differences. Lines/columns may appear as hints, but byte ranges are the source of truth.

Where do I find the data? Per-file snapshots live under .coderoot/v1/snapshots/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provenance Composition Model

Why PCM exists

What PCM tracks (at a glance)

Key terms (plain English)

Operations (cheat sheet)

Actors, origin, category & subtype (practical guidance)

Snapshots (what readers actually use)

What PCM deliberately does not store

Minimal redacted examples

Quick-start: verify PCM locally (2 minutes)

Privacy & safe-use tips

Origins vs Categories vs Buckets

Version & compatibility

FAQ (short)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally