From 743f5e783c772d465d63058ada21331671868d93 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 16:00:21 -0300 Subject: [PATCH 1/2] feat: Add ADR for type metadata format --- AGENTS.md | 1 + docs/adr/ADR-0004-query-ir-binary-format.md | 66 ++++--- docs/adr/ADR-0005-transition-graph-format.md | 4 +- docs/adr/ADR-0006-dynamic-query-execution.md | 5 +- docs/adr/ADR-0007-type-metadata-format.md | 189 +++++++++++++++++++ 5 files changed, 240 insertions(+), 25 deletions(-) create mode 100644 docs/adr/ADR-0007-type-metadata-format.md diff --git a/AGENTS.md b/AGENTS.md index 7f04014e..218ab289 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -20,6 +20,7 @@ - [ADR-0004: Query IR Binary Format](docs/adr/ADR-0004-query-ir-binary-format.md) - [ADR-0005: Transition Graph Format](docs/adr/ADR-0005-transition-graph-format.md) - [ADR-0006: Dynamic Query Execution](docs/adr/ADR-0006-dynamic-query-execution.md) + - [ADR-0007: Type Metadata Format](docs/adr/ADR-0007-type-metadata-format.md) - **Template**: ```markdown diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index d6e3da95..b64123a0 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -6,7 +6,7 @@ ## Context -The Query IR lives in a single contiguous allocation—cache-friendly, zero fragmentation, portable to WASM. This ADR defines the binary layout. Graph structures are in [ADR-0005](ADR-0005-transition-graph-format.md). +The Query IR lives in a single contiguous allocation—cache-friendly, zero fragmentation, portable to WASM. This ADR defines the binary layout. Graph structures are in [ADR-0005](ADR-0005-transition-graph-format.md). Type metadata is in [ADR-0007](ADR-0007-type-metadata-format.md). ## Decision @@ -20,7 +20,8 @@ struct QueryIR { negated_fields_offset: u32, string_refs_offset: u32, string_bytes_offset: u32, - type_info_offset: u32, + type_defs_offset: u32, + type_members_offset: u32, entrypoints_offset: u32, } ``` @@ -50,49 +51,68 @@ Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` | Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 | | String Refs | `[StringRef; R]` | `string_refs_offset` | 4 | | String Bytes | `[u8; S]` | `string_bytes_offset` | 1 | -| Type Info | `[TypeInfo; U]` | `type_info_offset` | 4 | -| Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 | +| Type Defs | `[TypeDef; T]` | `type_defs_offset` | 4 | +| Type Members | `[TypeMember; U]` | `type_members_offset` | 2 | +| Entrypoints | `[Entrypoint; V]` | `entrypoints_offset` | 4 | Each offset is aligned: `(offset + align - 1) & !(align - 1)`. -### Stringsi +For `Transition`, `EffectOp` see [ADR-0005](ADR-0005-transition-graph-format.md). For `TypeDef`, `TypeMember` see [ADR-0007](ADR-0007-type-metadata-format.md). -Single pool for all strings (field names, variant tags, entrypoint names): +### Strings + +Single pool for all strings (field names, variant tags, entrypoint names, type names): ```rust +type StringId = u16; + #[repr(C)] struct StringRef { offset: u32, // into string_bytes len: u16, _pad: u16, } +// 8 bytes, align 4 -#[repr(C)] -struct Entrypoint { - name_id: u16, // into string_refs - _pad: u16, - target: TransitionId, -} +type DataFieldId = StringId; // field names in effects +type VariantTagId = StringId; // variant tags in effects + +type TypeId = u16; // see ADR-0007 for semantics ``` -`DataFieldId(u16)` and `VariantTagId(u16)` index into `string_refs`. Distinct types, same table. +`StringId` indexes into `string_refs`. `DataFieldId` and `VariantTagId` are aliases for type safety. `TypeId` indexes into type_defs (with reserved primitives 0-2). Strings are interned during construction—identical strings share storage and ID. +### Entrypoints + +```rust +#[repr(C)] +struct Entrypoint { + name_id: StringId, // 2 + _pad: u16, // 2 + target: TransitionId, // 4 + result_type: TypeId, // 2 - see ADR-0007 + _pad2: u16, // 2 +} +// 12 bytes, align 4 +``` + ### Serialization ``` -Header (44 bytes): - magic: [u8; 4] b"PLNK" - version: u32 format version + ABI hash - checksum: u32 CRC32(offsets || buffer_data) +Header (48 bytes): + magic: [u8; 4] b"PLNK" + version: u32 format version + ABI hash + checksum: u32 CRC32(offsets || buffer_data) buffer_len: u32 successors_offset: u32 effects_offset: u32 negated_fields_offset: u32 string_refs_offset: u32 string_bytes_offset: u32 - type_info_offset: u32 + type_defs_offset: u32 + type_members_offset: u32 entrypoints_offset: u32 Buffer Data (buffer_len bytes) @@ -104,7 +124,7 @@ Little-endian always. UTF-8 strings. Version mismatch or checksum failure → re Three passes: -1. **Analysis**: Count elements, intern strings +1. **Analysis**: Count elements, intern strings, infer types 2. **Layout**: Compute aligned offsets, allocate once 3. **Emission**: Write via `ptr::write` @@ -128,15 +148,16 @@ Buffer layout: 0x0280 Negated Fields [] 0x0280 String Refs [{0,4}, {4,5}, {9,5}, ...] 0x02C0 String Bytes "namevalueIdentNumFuncExpr" -0x0300 Type Info [...] -0x0340 Entrypoints [{4, T0}, {5, T3}] +0x0300 Type Defs [Record{...}, Enum{...}, ...] +0x0340 Type Members [{name,Str}, {Ident,Ty5}, ...] +0x0380 Entrypoints [{name=Func, target=Tr0, type=Ty3}, ...] ``` `"name"` stored once, used by both `@name` captures. ## Consequences -**Positive**: Cache-efficient, O(1) string lookup, zero-copy access, simple validation. +**Positive**: Cache-efficient, O(1) string lookup, zero-copy access, simple validation. Self-contained binaries enable query caching by input hash. **Negative**: Format changes require rebuild. No version migration. @@ -146,3 +167,4 @@ Buffer layout: - [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) - [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) +- [ADR-0007: Type Metadata Format](ADR-0007-type-metadata-format.md) diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index 8da2807b..f4255c77 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -16,9 +16,8 @@ Edge-centric IR: transitions carry all semantics (matching, effects, successors) type TransitionId = u32; type NodeTypeId = u16; // from tree-sitter, do not change type NodeFieldId = NonZeroU16; // from tree-sitter, Option uses 0 for None -type DataFieldId = u16; -type VariantTagId = u16; type RefId = u16; +// StringId, DataFieldId, VariantTagId: see ADR-0004 ``` ### Slice @@ -308,3 +307,4 @@ Incoming epsilon effects → `pre_effects`. Outgoing → `post_effects`. - [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) - [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) +- [ADR-0007: Type Metadata Format](ADR-0007-type-metadata-format.md) diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index 160deaca..90103879 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -99,10 +99,12 @@ struct BacktrackPoint { cursor_checkpoint: u32, // tree-sitter descendant_index effect_watermark: u32, recursion_frame: Option, // saved frame index - alternatives: Slice, + alternatives: Slice, // view into IR successors, not owned } ``` +`alternatives` references the IR's successor data (inline or spilled)—no runtime allocation per backtrack point. + | Operation | Action | | --------- | ------------------------------------------------------ | | Save | `cursor_checkpoint = cursor.descendant_index()` — O(1) | @@ -196,3 +198,4 @@ Details deferred. - [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) - [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) +- [ADR-0007: Type Metadata Format](ADR-0007-type-metadata-format.md) diff --git a/docs/adr/ADR-0007-type-metadata-format.md b/docs/adr/ADR-0007-type-metadata-format.md new file mode 100644 index 00000000..9bc56b8a --- /dev/null +++ b/docs/adr/ADR-0007-type-metadata-format.md @@ -0,0 +1,189 @@ +# ADR-0007: Type Metadata Format + +- **Status**: Accepted +- **Date**: 2025-01-13 + +## Context + +Query execution produces structured values via the effect stream ([ADR-0006](ADR-0006-dynamic-query-execution.md)). Type metadata enables: + +- **Code generation**: Emit Rust structs, TypeScript interfaces, Python dataclasses +- **Validation**: Verify effect stream output matches expected shape (debug/test builds) +- **Tooling**: IDE completions, documentation generation + +Type metadata is descriptive, not prescriptive. Transitions define execution semantics; types describe what transitions produce. + +**Cache efficiency goal**: Proc macro compilation inlines query logic as native instructions (I-cache), leaving D-cache exclusively for tree-sitter cursor traversal. Type metadata is consumed at compile time, not runtime. + +## Decision + +### TypeId + +```rust +type TypeId = u16; + +const TYPE_VOID: TypeId = 0; // definition captures nothing +const TYPE_NODE: TypeId = 1; // AST node reference (see "Node Semantics" below) +const TYPE_STR: TypeId = 2; // extracted source text (:: string) +// 3..0xFFFE: composite types (index into type_defs + 3) +const TYPE_INVALID: TypeId = 0xFFFF; // error sentinel during inference +``` + +Type alias declared in [ADR-0004](ADR-0004-query-ir-binary-format.md); constants and semantics here. + +Primitives exist only as TypeId values—no TypeDef entries. Composite types start at ID 3. + +### Node Semantics + +`TYPE_NODE` represents a platform-dependent handle to a tree-sitter AST node: + +| Context | Representation | +| ---------- | ---------------------------------------------------------- | +| Rust | `tree_sitter::Node<'tree>` (lifetime-bound reference) | +| TypeScript | Binding-provided object with `startPosition`, `text`, etc. | +| Text/JSON | Unique node identifier (e.g., `"node:42"` or path-based) | + +The handle provides access to node metadata (kind, span, text) without copying the source. Lifetime management is platform-specific—Rust enforces it statically, bindings may use reference counting or arena allocation. + +### TypeDef + +```rust +#[repr(C)] +struct TypeDef { + kind: TypeKind, // 1 + _pad: u8, // 1 + name: StringId, // 2 - synthetic or explicit, 0xFFFF for wrappers + data: u32, // 4 - TypeId for wrappers, slice offset for composites + data_len: u16, // 2 - 0 for wrappers, member count for composites + _pad2: u16, // 2 +} +// 12 bytes, align 4 +``` + +Uses `u16` for `data_len` instead of `Slice`'s `u32` — no type has 65k members. Saves 2 bytes per TypeDef. + +### TypeKind + +```rust +#[repr(C, u8)] +enum TypeKind { + Optional = 0, // T? — data: inner TypeId + ArrayStar = 1, // T* — data: element TypeId + ArrayPlus = 2, // T+ — data: element TypeId + Record = 3, // struct — data/data_len: slice into type_members + Enum = 4, // tagged union — data/data_len: slice into type_members +} +``` + +| Kind | Query Syntax | Semantics | +| --------- | ------------------- | -------------------------------- | +| Optional | `expr?` | Nullable wrapper | +| ArrayStar | `expr*` | Zero or more elements | +| ArrayPlus | `expr+` | One or more elements (non-empty) | +| Record | `{ ... } @name` | Named fields | +| Enum | `[ A: ... B: ... ]` | Tagged union (discriminated) | + +### TypeMember + +Shared structure for Record fields and Enum variants: + +```rust +#[repr(C)] +struct TypeMember { + name: StringId, // 2 - field name or variant tag + ty: TypeId, // 2 - field type or variant payload (TYPE_VOID for unit) +} +// 4 bytes, align 2 +``` + +### Synthetic Naming + +When no explicit `:: TypeName` annotation exists, names are synthesized: + +| Context | Pattern | Example | +| -------------------- | --------------- | ---------------------------------------- | +| Definition | Definition name | `Func` | +| Captured sequence | `{Def}{Field}` | `FuncParams` for `@params` in `Func` | +| Captured alternation | `{Def}{Field}` | `FuncBody` for `@body` in `Func` | +| Variant payload | `{Parent}{Tag}` | `FuncBodyStmt` for `Stmt:` in `FuncBody` | + +Collisions resolved by numeric suffix: `FuncBody`, `FuncBody2`, etc. + +### Example + +Query: + +``` +Func = (function_declaration + name: (identifier) @name :: string + body: [ + Stmt: (statement) @stmt + Expr: (expression) @expr + ] @body +) +``` + +Type graph: + +``` +T3: Record "Func" → [name: Str, body: T4] +T4: Enum "FuncBody" → [Stmt: T5, Expr: T6] +T5: Record "FuncBodyStmt" → [stmt: Node] +T6: Record "FuncBodyExpr" → [expr: Node] + +Entrypoint: Func → result_type: T3 +``` + +Generated TypeScript: + +```typescript +interface Func { + name: string; + body: + | { $tag: "Stmt"; $data: { stmt: Node } } + | { $tag: "Expr"; $data: { expr: Node } }; +} +``` + +Generated Rust: + +```rust +struct Func { + name: String, + body: FuncBody, +} + +enum FuncBody { + Stmt { stmt: Node }, + Expr { expr: Node }, +} +``` + +### Validation + +Optional runtime check for debugging: + +```rust +fn validate(value: &Value, expected: TypeId, ir: &QueryIR) -> Result<(), TypeError>; +``` + +Walk the `Value` tree, verify shape matches `TypeId`. Mismatch indicates IR construction bug—panic in debug, skip in release. + +## Consequences + +**Positive**: + +- Single IR serves interpreter, proc macro codegen, and external tooling +- Language-agnostic: same metadata generates Rust, TypeScript, Python, etc. +- Self-contained queries enable caching by input hash (`~/.cache/plotnik/`) + +**Negative**: + +- Synthetic names can be verbose for deeply nested structures +- KB-scale overhead for complex queries (acceptable) + +## References + +- [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) From 311788d150684ba94a6faa7bac5642b21e178ea2 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 16:12:28 -0300 Subject: [PATCH 2/2] Small additions --- AGENTS.md | 2 +- docs/adr/ADR-0004-query-ir-binary-format.md | 2 ++ docs/adr/ADR-0005-transition-graph-format.md | 16 ++++++++++++---- docs/adr/ADR-0006-dynamic-query-execution.md | 11 ++++++++++- 4 files changed, 25 insertions(+), 6 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 218ab289..c1e91de5 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -49,7 +49,7 @@ ADRs must be succint and straight to the point. They must contain examples with high information density and pedagogical value. These are docs people usually don't want to read, but when they do, they find it quite fascinating. -Avoid imperative code, describe structure definitions, their purpose and how to use them properly. +Don't write imperative code, describe structure definitions, their purpose and how to use them properly (and how to NOT use). # Plotnik Query Language diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index b64123a0..12aa8f1a 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -41,6 +41,8 @@ struct QueryIRBuffer { Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. The 64-byte alignment ensures transitions never straddle cache lines. +**Deallocation**: `QueryIRBuffer` must implement `Drop` to reconstruct the exact `Layout` (size + 64-byte alignment) and call `std::alloc::dealloc`. Using `Box::from_raw` or similar would assume align=1 and cause undefined behavior. + ### Segments | Segment | Type | Offset | Align | diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index f4255c77..acc717d0 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -60,10 +60,10 @@ Single `ref_marker` slot—sequences like `Enter(A) → Enter(B)` remain as epsi Successors use a small-size optimization to avoid indirection for the common case: -| `successor_count` | Layout | -| ----------------- | ------------------------------------------------------------------------------------ | -| 0–5 | `successor_data[0..count]` contains `TransitionId` values directly | -| > 5 | `successor_data[0]` is offset into `successors` segment, `successor_count` is length | +| `successor_count` | Layout | +| ----------------- | ----------------------------------------------------------------------------------- | +| 0–5 | `successor_data[0..count]` contains `TransitionId` values directly | +| > 5 | `successor_data[0]` is index into `successors` segment, `successor_count` is length | Why 5 slots: 24 available bytes / 4 bytes per `TransitionId` = 6 slots, minus 1 for the count field leaves 5. @@ -280,6 +280,14 @@ T6: ε + Field("val") + EndVariant → [T7] Partial—full elimination impossible due to single `ref_marker`. +**Execution order** (all transitions, including epsilon): + +1. Emit `pre_effects` +2. Execute matcher (epsilon always succeeds) +3. On success: emit implicit `CaptureNode`, emit `post_effects` + +An epsilon transition with `pre: [StartObject]` and `post: [EndObject]` legitimately creates an empty object. To avoid accidental empty structures in graph rewrites, move effects to the destination's `pre` or source's `post` as appropriate. + Why pre/post split matters: ``` diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index 90103879..f70f5fad 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -131,7 +131,16 @@ struct CallFrame { } ``` -**Append-only invariant**: Frames are never removed. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. +**Append-only invariant**: Frames persist for backtracking correctness. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. + +**Frame pruning**: After `Exit`, frames at the stack top may be reclaimed if: + +1. Not the current frame (already exited) +2. Not referenced by any live backtrack point + +This bounds memory by `max(recursion_depth, backtrack_depth)` rather than total call count. Without pruning, `(Rule)*` over N items allocates N frames; with pruning, it remains O(1) for non-backtracking iteration. + +The `BacktrackPoint.recursion_frame` field establishes a "high-water mark"—the minimum frame index that must be preserved. Frames above this mark with no active reference can be popped. | Operation | Action | | ----------------- | ------------------------------------------------------------------------------ |