Skip to content

Commit d061394

Browse files
committed
docs: add design document
Assisted-by: Claude.ai (Claude Opus 4.6)
1 parent 146dbfb commit d061394

1 file changed

Lines changed: 263 additions & 0 deletions

File tree

docs/design/git-data.md

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
+++
2+
title = "git-data"
3+
subtitle = "Design Specification"
4+
version = "0.1.0"
5+
date = 2026-03-19
6+
status = "Draft"
7+
+++
8+
9+
# git-data
10+
11+
## Overview
12+
13+
git-data is a workspace of three crates that provide structured data primitives over Git refs.
14+
15+
Git has refs, commits, trees, and blobs.
16+
It has no built-in concept of structured annotations on objects, versioned records outside of branches, or bidirectional relationships between refs.
17+
These three patterns recur in any system that uses Git as a data store.
18+
19+
git-data provides them as independent, composable libraries.
20+
21+
22+
## Crates
23+
24+
### git-metadata
25+
26+
Structured annotations on existing Git objects.
27+
28+
A metadata entry is keyed by the OID of the object it annotates.
29+
The annotated object exists independently; the metadata describes it.
30+
This extends Git's notes (which map OIDs to blobs) to map OIDs to trees, allowing multiple tools to attach named entries under the same OID without conflict.
31+
32+
**Ref structure:**
33+
34+
```text
35+
refs/metadata/<namespace> → commit → tree
36+
<oid-prefix>/
37+
<oid-suffix>/
38+
<entry-name> # blob: arbitrary content
39+
```
40+
41+
The two-level fanout by OID prevents pathological tree sizes, matching the pattern Git uses internally for loose objects.
42+
43+
**Operations:**
44+
45+
- `attach(oid, path, content)` — write a blob at `<oid>/<path>` in the metadata tree.
46+
- `read(oid, path)` — read a single entry.
47+
- `read_all(oid)` — list all entries for an object.
48+
- `remove(oid, path)` — delete an entry.
49+
50+
Every write is a new commit on the metadata ref.
51+
The commit history is the audit log of all annotation changes.
52+
53+
**Concurrency:**
54+
55+
Two writers annotating different OIDs touch disjoint tree paths.
56+
A three-way tree merge resolves these automatically.
57+
Two writers annotating the same OID at different paths also merge cleanly.
58+
Conflict occurs only when two writers modify the same entry on the same OID simultaneously — the correct resolution is rejection and retry.
59+
60+
61+
### git-ledger
62+
63+
Versioned records stored as refs.
64+
65+
A ledger entry is a standalone ref with its own lifecycle.
66+
It is not metadata on any object — it is an independent record with a sequential ID, commit history as an audit log, and tree-structured state.
67+
68+
**Ref structure:**
69+
70+
```text
71+
refs/<namespace>/<id> → commit → tree
72+
<field> # blob: field value
73+
<field>
74+
<subdir>/
75+
<field>
76+
```
77+
78+
Each record is its own ref.
79+
Two writers modifying different records never conflict.
80+
81+
**ID assignment:**
82+
83+
Sequential IDs are assigned by scanning `refs/<namespace>/` to find the highest existing ID and incrementing.
84+
The ref creation itself is the compare-and-swap: if another writer created the same ID, the push fails and the creator rescans and retries.
85+
86+
No counter ref is required.
87+
The source of truth for "what IDs exist" is the refs themselves.
88+
89+
At large scale (thousands of records), scanning all refs to find the max becomes expensive.
90+
An optional counter ref can serve as an acceleration structure, same as any other derived index — a performance optimization, not a correctness requirement.
91+
If the counter is lost or stale, a rescan rebuilds it.
92+
93+
**Operations:**
94+
95+
- `create(namespace, fields)` — scan for the next available ID, create a new ref with an initial commit containing the given tree. Retry on conflict.
96+
- `read(namespace, id)` — read the current tree at a record's ref.
97+
- `update(namespace, id, mutations)` — commit a new tree to the record's ref. The previous state is preserved in history.
98+
- `list(namespace)` — prefix scan over `refs/<namespace>/` to enumerate records.
99+
- `history(namespace, id)` — walk the commit chain on a record's ref.
100+
101+
**Namespace scoping:**
102+
103+
Namespaces partition records into independent groups, each with its own ref subtree:
104+
105+
```text
106+
refs/<namespace>/<scope>/<id>
107+
```
108+
109+
Scopes are fully independent: no cross-scope contention on ID assignment or record writes.
110+
111+
112+
### git-links
113+
114+
Bidirectional relationships between refs.
115+
116+
A link connects two keys.
117+
It does not belong to either of them.
118+
Both directions are written in a single commit to a single ref, guaranteeing consistency without multi-ref atomicity.
119+
120+
**Ref structure:**
121+
122+
```text
123+
refs/<namespace> → commit → tree
124+
<key-a>/
125+
<key-b> # blob: empty or optional metadata
126+
<key-b>/
127+
<key-a> # blob: empty or optional metadata
128+
```
129+
130+
Keys are opaque path segments.
131+
The library does not interpret them.
132+
Consumers assign meaning.
133+
134+
When metadata is absent, the tree entry points to the empty blob (`e69de29...`).
135+
Every metadata-free link shares this single object.
136+
137+
**Operations:**
138+
139+
- `link(a, b, metadata?)` — write both directions in one commit.
140+
- `unlink(a, b)` — remove both directions in one commit.
141+
- `linked(key)` — list all keys linked to this key (single tree read).
142+
- `is_linked(a, b)` — check existence (single tree entry lookup).
143+
144+
**Concurrency:**
145+
146+
Two writers linking disjoint key pairs touch disjoint tree paths.
147+
A three-way tree merge resolves these automatically.
148+
Conflict occurs only when two writers modify the same link simultaneously.
149+
150+
**Ref ownership:**
151+
152+
The `namespace` is caller-provided.
153+
The library owns no ref namespace.
154+
A consumer passes `"refs/links"` or `"refs/my-tool/links"` — the library does not care.
155+
156+
**Example: forge issue linking.**
157+
158+
Forge uses `git-links` with `refs/forge/links` as the namespace.
159+
Keys are type/ID strings that forge constructs; the library stores them verbatim.
160+
161+
Linking issue 42 to review 7 and commit `abc123`:
162+
163+
```rust
164+
let links = LinkStore::new(&repo, "refs/forge/links");
165+
166+
links.link("issue/42", "review/7", None, &sig)?;
167+
links.link("issue/42", "commit/abc123", None, &sig)?;
168+
```
169+
170+
This produces:
171+
172+
```text
173+
refs/forge/links → commit → tree
174+
issue/42/
175+
review/7 # empty blob
176+
commit/abc123 # empty blob
177+
review/7/
178+
issue/42 # empty blob
179+
commit/abc123/
180+
issue/42 # empty blob
181+
```
182+
183+
Querying everything linked to issue 42:
184+
185+
```rust
186+
let related = links.linked("issue/42")?;
187+
// → ["review/7", "commit/abc123"]
188+
```
189+
190+
Querying the reverse — all issues referencing commit `abc123`:
191+
192+
```rust
193+
let related = links.linked("commit/abc123")?;
194+
// → ["issue/42"]
195+
```
196+
197+
Both directions are tree reads.
198+
Forge parses the key strings to recover type and ID.
199+
The library never does.
200+
201+
202+
## Layering
203+
204+
```text
205+
git (objects, refs, transport)
206+
├── git-metadata (annotations on objects)
207+
├── git-ledger (versioned records as refs)
208+
└── git-links (bidirectional relationships)
209+
```
210+
211+
The three crates are independent.
212+
None depends on another.
213+
A consumer may use any combination.
214+
215+
The shared machinery — ref → commit → tree reads and writes, tree merging, commit signing — is either inlined or extracted to a shared internal crate if duplication warrants it.
216+
This is a code organization decision, not an architectural one.
217+
218+
219+
## What git-data Is Not
220+
221+
git-data is not a framework.
222+
It imposes no schema, no workflow, no naming convention beyond ref structure.
223+
224+
git-data does not run hooks or enforce policy.
225+
Consumers (forge, kiln, other tools) own domain logic.
226+
227+
git-data does not handle transport.
228+
Push, fetch, and ref advertisement filtering are the consumer's responsibility.
229+
230+
git-data does not handle merge strategy selection.
231+
It provides the primitives (tree reads, tree writes, atomic commits) that make auto-merge possible.
232+
The consumer decides when and how to merge.
233+
234+
235+
## Workspace Layout
236+
237+
```text
238+
git-data/
239+
├── Cargo.toml # workspace root
240+
├── crates/
241+
│ ├── git-metadata/
242+
│ │ ├── Cargo.toml
243+
│ │ └── src/
244+
│ ├── git-ledger/
245+
│ │ ├── Cargo.toml
246+
│ │ └── src/
247+
│ └── git-links/
248+
│ ├── Cargo.toml
249+
│ └── src/
250+
```
251+
252+
Each crate publishes independently to crates.io.
253+
The workspace shares test infrastructure, CI, and release tooling.
254+
255+
256+
## CLI
257+
258+
git-metadata ships a CLI as `git-metadata` (invoked as `git metadata`).
259+
It is the only crate with a CLI at this time.
260+
261+
git-ledger and git-links are library-only.
262+
They may gain CLIs if direct human use outside of a consumer tool proves valuable.
263+
This is unlikely — the operations are meaningful only in the context of a specific schema, which the consumer defines.

0 commit comments

Comments
 (0)