From 3ede5b308e7dd5ca794c4ec0ead01cfbee093b12 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 11:50:31 -0300 Subject: [PATCH 1/6] refactor: Split ADR-0003 --- AGENTS.md | 5 +- ...-0003-query-intermediate-representation.md | 2 +- docs/adr/ADR-0004-query-ir-binary-format.md | 141 ++++++++++ docs/adr/ADR-0005-transition-graph-format.md | 252 ++++++++++++++++++ docs/adr/ADR-0006-dynamic-query-execution.md | 175 ++++++++++++ 5 files changed, 573 insertions(+), 2 deletions(-) create mode 100644 docs/adr/ADR-0004-query-ir-binary-format.md create mode 100644 docs/adr/ADR-0005-transition-graph-format.md create mode 100644 docs/adr/ADR-0006-dynamic-query-execution.md diff --git a/AGENTS.md b/AGENTS.md index 053e77b4..519f13de 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,7 +16,10 @@ - **Index**: - [ADR-0001: Query Parser](docs/adr/ADR-0001-query-parser.md) - [ADR-0002: Diagnostics System](docs/adr/ADR-0002-diagnostics-system.md) - - [ADR-0003: Query Intermediate Representation](docs/adr/ADR-0003-query-intermediate-representation.md) + - [ADR-0003: Query Intermediate Representation](docs/adr/ADR-0003-query-intermediate-representation.md) (superseded by ADR-0004, ADR-0005, ADR-0006) + - [ADR-0004: Query IR Binary Format](docs/adr/ADR-0004-query-ir-binary-format.md) + - [ADR-0005: Transition Graph Format](docs/adr/ADR-0005-transition-graph-format.md) + - [ADR-0006: Dynamic Query Execution](docs/adr/ADR-0006-dynamic-query-execution.md) - **Template**: ```markdown diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index f16dfb74..59dc57a2 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -1,6 +1,6 @@ # ADR-0003: Query Intermediate Representation -- **Status**: Accepted +- **Status**: Superseded by [ADR-0004](ADR-0004-query-ir-binary-format.md), [ADR-0005](ADR-0005-transition-graph-format.md), [ADR-0006](ADR-0006-dynamic-query-execution.md) - **Date**: 2025-12-10 ## Context diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md new file mode 100644 index 00000000..df2d8b67 --- /dev/null +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -0,0 +1,141 @@ +# ADR-0004: Query IR Binary Format + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) + +## Context + +The Query IR lives in a single contiguous allocation—cache-friendly, zero fragmentation, portable to WASM. This ADR defines the binary layout. Graph structures are in [ADR-0005](ADR-0005-transition-graph-format.md). + +## Decision + +### Container + +```rust +struct TransitionGraph { + data: Arena, + successors_offset: u32, + effects_offset: u32, + negated_fields_offset: u32, + string_refs_offset: u32, + string_bytes_offset: u32, + entrypoints_offset: u32, + default_entrypoint: TransitionId, +} +``` + +Transitions start at offset 0 (implicit). + +### Arena + +```rust +const ARENA_ALIGN: usize = 4; + +struct Arena { + ptr: *mut u8, + len: usize, +} +``` + +Allocated via `Layout::from_size_align(len, ARENA_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. + +### Segments + +| Segment | Type | Offset | Align | +| -------------- | ------------------- | ----------------------- | ----- | +| Transitions | `[Transition; N]` | 0 | 4 | +| Successors | `[TransitionId; M]` | `successors_offset` | 4 | +| Effects | `[EffectOp; P]` | `effects_offset` | 2 | +| Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 | +| String Refs | `[StringRef; R]` | `string_refs_offset` | 4 | +| String Bytes | `[u8; S]` | `string_bytes_offset` | 1 | +| Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 | + +Each offset is aligned: `(offset + align - 1) & !(align - 1)`. + +### Strings + +Single pool for all strings (field names, variant tags, entrypoint names): + +```rust +#[repr(C)] +struct StringRef { + offset: u32, // into string_bytes + len: u16, + _pad: u16, +} + +#[repr(C)] +struct Entrypoint { + name_id: u16, // into string_refs + _pad: u16, + target: TransitionId, +} +``` + +`DataFieldId(u16)` and `VariantTagId(u16)` index into `string_refs`. Distinct types, same table. + +Strings are interned during construction—identical strings share storage and ID. + +### Serialization + +``` +Header (20 bytes): + magic: [u8; 4] b"PLNK" + version: u32 format version + ABI hash + checksum: u32 CRC32(segment_offsets || arena_data) + arena_len: u32 + segment_count: u32 + +Segment Offsets (segment_count × 4 bytes) +Arena Data (arena_len bytes) +``` + +Little-endian always. UTF-8 strings. Version mismatch or checksum failure → recompile. + +### Construction + +Three passes: + +1. **Analysis**: Count elements, intern strings +2. **Layout**: Compute aligned offsets, allocate once +3. **Emission**: Write via `ptr::write` + +No `realloc`. + +### Example + +Query: + +``` +Func = (function_declaration name: (identifier) @name) +Expr = [ Ident: (identifier) @name Num: (number) @value ] +``` + +Arena layout: + +``` +0x0000 Transitions [T0, T1, T2, ...] +0x0180 Successors [1, 2, 3, ...] +0x0200 Effects [StartObject, Field(0), ...] +0x0280 Negated Fields [] +0x0280 String Refs [{0,4}, {4,5}, {9,5}, ...] +0x02C0 String Bytes "namevalueIdentNum FuncExpr" +0x0300 Entrypoints [{4, T0}, {5, T3}] +``` + +`"name"` stored once, used by both `@name` captures. + +## Consequences + +**Positive**: Cache-efficient, O(1) string lookup, zero-copy access, simple validation. + +**Negative**: Format changes require rebuild. No version migration. + +**WASM**: Explicit alignment prevents traps. `u32` offsets fit WASM32. + +## References + +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md new file mode 100644 index 00000000..d11c63dd --- /dev/null +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -0,0 +1,252 @@ +# ADR-0005: Transition Graph Format + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) + +## Context + +Edge-centric IR: transitions carry all semantics (matching, effects, successors). States are implicit junction points. The result is a recursive transition network—NFA with call/return for definition references. + +## Decision + +### Types + +```rust +type TransitionId = u32; +type DataFieldId = u16; +type VariantTagId = u16; +type RefId = u16; +``` + +### Slice + +Relative range within a segment: + +```rust +#[repr(C)] +struct Slice { + start: u32, + len: u32, + _phantom: PhantomData, +} +``` + +### Transition + +```rust +#[repr(C)] +struct Transition { + matcher: Matcher, // 16 bytes + pre_anchored: bool, // 1 + post_anchored: bool, // 1 + _pad1: [u8; 2], // 2 + pre_effects: Slice, // 8 + post_effects: Slice, // 8 + ref_marker: RefTransition, // 4 + next: Slice, // 8 +} +// 48 bytes, align 4 +``` + +Single `ref_marker` slot—sequences like `Enter(A) → Enter(B)` remain as epsilon chains. + +### Matcher + +```rust +#[repr(C, u32)] +enum Matcher { + Epsilon, + Node { + kind: NodeTypeId, // 2 + field: Option, // 2 + negated_fields: Slice, // 8 + }, + Anonymous { + kind: NodeTypeId, // 2 + field: Option, // 2 + }, + Wildcard, + Down, // cursor to first child + Up, // cursor to parent +} +// 16 bytes, align 4 +``` + +`NodeFieldId` is `NonZeroU16`—`Option` uses 0 for `None`. + +### RefTransition + +```rust +#[repr(C, u8)] +enum RefTransition { + None, + Enter(RefId), // push return stack + Exit(RefId), // pop, must match +} +// 4 bytes, align 2 +``` + +Explicit `None` ensures stable binary layout (`Option` niche is unspecified). + +### EffectOp + +```rust +#[repr(C)] +enum EffectOp { + StartArray, + PushElement, + EndArray, + StartObject, + EndObject, + Field(DataFieldId), + StartVariant(VariantTagId), + EndVariant, + ToString, +} +// 4 bytes, align 2 +``` + +No `CaptureNode`—implicit on successful match. + +### Effect Placement + +| Effect | Placement | Why | +| -------------- | --------- | -------------------------- | +| `StartArray` | Pre | Container before elements | +| `StartObject` | Pre | Container before fields | +| `StartVariant` | Pre | Tag before payload | +| `PushElement` | Post | Consumes matched node | +| `Field` | Post | Consumes matched node | +| `End*` | Post | Finalizes after last match | +| `ToString` | Post | Converts matched node | + +### View Types + +```rust +struct TransitionView<'a> { + graph: &'a TransitionGraph, + raw: &'a Transition, +} + +struct MatcherView<'a> { + graph: &'a TransitionGraph, + raw: &'a Matcher, +} + +enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } +``` + +Views resolve `Slice` to `&[T]`. Engine code never touches offsets directly. + +### Quantifiers + +**Greedy `*`**: + +``` + ┌─────────────────┐ + ↓ │ +Entry ─ε→ Branch ─ε→ Match ─┘ + │ + └─ε→ Exit + +Branch.next = [match, exit] +``` + +**Greedy `+`**: + +``` + ┌─────────────────┐ + ↓ │ +Entry ─→ Match ─ε→ Branch ─┘ + │ + └─ε→ Exit + +Branch.next = [match, exit] +``` + +**Non-greedy `*?`/`+?`**: Same, but `Branch.next = [exit, match]`. + +### Example: Array + +Query: `(parameters (identifier)* @params)` + +Before elimination: + +``` +T0: ε + StartArray → [T1] +T1: ε (branch) → [T2, T4] +T2: Match(identifier) → [T3] +T3: ε + PushElement → [T1] +T4: ε + EndArray → [T5] +T5: ε + Field("params") → [...] +``` + +After: + +``` +T2': pre:[StartArray] Match(identifier) post:[PushElement] → [T2', T4'] +T4': post:[EndArray, Field("params")] → [...] +``` + +First iteration gets `StartArray` from T0's path. Loop iterations skip it. + +### Example: Object + +Query: `{ (identifier) @name (number) @value } @pair` + +``` +T0: ε + StartObject → [T1] +T1: Match(identifier) → [T2] +T2: ε + Field("name") → [T3] +T3: Match(number) → [T4] +T4: ε + Field("value") → [T5] +T5: ε + EndObject → [T6] +T6: ε + Field("pair") → [...] +``` + +### Example: Tagged Alternation + +Query: `[ A: (true) @val B: (false) @val ]` + +``` +T0: ε (branch) → [T1, T4] +T1: ε + StartVariant("A") → [T2] +T2: Match(true) → [T3] +T3: ε + Field("val") + EndVariant → [T7] +T4: ε + StartVariant("B") → [T5] +T5: Match(false) → [T6] +T6: ε + Field("val") + EndVariant → [T7] +``` + +### Epsilon Elimination + +Partial—full elimination impossible due to single `ref_marker`. + +Why pre/post split matters: + +``` +Before: +T1: Match(A) → [T2] // current = A +T2: ε + PushElement → [T3] // push A ✓ +T3: Match(B) → [...] // current = B + +After (correct): +T3': pre:[PushElement] Match(B) // push A, then match B ✓ + +Wrong (no split): +T3': Match(B) post:[PushElement] // match B, push B ✗ +``` + +Incoming epsilon effects → `pre_effects`. Outgoing → `post_effects`. + +## Consequences + +**Positive**: No state objects. Compact 48-byte transitions. Views hide offset arithmetic. + +**Negative**: Single `ref_marker` leaves some epsilon chains. Large queries may pressure cache. + +## References + +- [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md new file mode 100644 index 00000000..a6f8adc6 --- /dev/null +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -0,0 +1,175 @@ +# ADR-0006: Dynamic Query Execution + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) + +## Context + +Runtime interpretation of the transition graph ([ADR-0005](ADR-0005-transition-graph-format.md)). Proc-macro compilation is a future ADR. + +## Decision + +### Execution Order + +For each transition: + +1. Emit `pre_effects` +2. Match (epsilon always succeeds) +3. On success: emit `CaptureNode`, emit `post_effects` +4. Process `next` with backtracking + +### Effect Stream + +```rust +enum RuntimeEffect<'a> { + Op(EffectOp), + CaptureNode(Node<'a>), // implicit on match, never in graph +} + +struct EffectStream<'a> { + effects: Vec>, +} +``` + +Append-only. Backtrack via `truncate(watermark)`. + +### Executor + +```rust +struct Executor<'a> { + current: Option>, + stack: Vec>, +} + +enum Value<'a> { + Node(Node<'a>), + String(String), + Array(Vec>), + Object(BTreeMap>), + Variant(VariantTagId, Box>), +} + +enum Container<'a> { + Array(Vec>), + Object(BTreeMap>), + Variant(VariantTagId), +} +``` + +| Effect | Action | +| ------------------- | ------------------------------------ | +| `CaptureNode(n)` | `current = Node(n)` | +| `StartArray` | push `Array([])` onto stack | +| `PushElement` | move `current` into top array | +| `EndArray` | pop array into `current` | +| `StartObject` | push `Object({})` onto stack | +| `Field(id)` | move `current` into top object field | +| `EndObject` | pop object into `current` | +| `StartVariant(tag)` | push `Variant(tag)` onto stack | +| `EndVariant` | pop, wrap `current`, set as current | +| `ToString` | replace `current` Node with text | + +Invalid state = IR bug → panic. + +### Backtracking + +Two checkpoints, saved together: + +- `cursor.descendant_index()` → restore via `goto_descendant(pos)` +- `effect_stream.len()` → restore via `truncate(watermark)` + +### Recursion + +```rust +struct Frame { + ref_id: RefId, + cursor_checkpoint: usize, + effect_watermark: usize, +} + +struct Interpreter<'a> { + graph: &'a TransitionGraph, + stack: Vec, + cursor: TreeCursor<'a>, + effects: EffectStream<'a>, +} +``` + +`Enter(ref_id)`: push frame, follow `next` into definition. + +`Exit(ref_id)`: verify match, pop frame, continue unconditionally. + +Entry filtering: only take `Exit(ref_id)` if it matches stack top. + +### Example + +Query: + +``` +Func = (function_declaration + name: (identifier) @name + parameters: (parameters (identifier)* @params :: string)) +``` + +Input: `function foo(a, b) {}` + +**Phase 1: Match → Effect Stream** + +``` +pre: StartObject +match function_declaration → CaptureNode(func) +match identifier "foo" → CaptureNode(foo) +post: Field("name") +pre: StartArray +match identifier "a" → CaptureNode(a), ToString, PushElement +match identifier "b" → CaptureNode(b), ToString, PushElement +post: EndArray, Field("params"), EndObject +``` + +**Phase 2: Execute → Value** + +| Effect | current | stack | +| ---------------- | --------- | ---------------- | +| StartObject | — | [{}] | +| CaptureNode(foo) | Node(foo) | [{}] | +| Field("name") | — | [{name:Node}] | +| StartArray | — | [{…}, []] | +| CaptureNode(a) | Node(a) | [{…}, []] | +| ToString | "a" | [{…}, []] | +| PushElement | — | [{…}, ["a"]] | +| CaptureNode(b) | Node(b) | [{…}, ["a"]] | +| ToString | "b" | [{…}, ["a"]] | +| PushElement | — | [{…}, ["a","b"]] | +| EndArray | ["a","b"] | [{…}] | +| Field("params") | — | [{…,params}] | +| EndObject | {…} | [] | + +Result: `{ name: , params: ["a", "b"] }` + +### Variant Serialization + +```json +{ "$tag": "A", "$data": { "x": 1 } } +{ "$tag": "B", "$data": [1, 2, 3] } +``` + +Uniform structure. `$tag`/`$data` avoid capture collisions. + +### Fuel + +- `transition_fuel`: decremented per transition +- `recursion_fuel`: decremented per `Enter` + +Details deferred. + +## Consequences + +**Positive**: Append-only stream makes backtracking trivial. Two-phase separation is clean. + +**Negative**: Interpretation overhead. Extra pass for effect execution. + +## References + +- [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) From c543c5bc6a18d6edfdda11ed5f5d42ff1afd4cd2 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 12:02:30 -0300 Subject: [PATCH 2/6] Update terminology from "TransitionGraph" to "QueryIR" --- AGENTS.md | 6 ++++++ docs/adr/ADR-0004-query-ir-binary-format.md | 2 +- docs/adr/ADR-0005-transition-graph-format.md | 4 ++-- docs/adr/ADR-0006-dynamic-query-execution.md | 2 +- 4 files changed, 10 insertions(+), 4 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 519f13de..23a612bd 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -43,6 +43,12 @@ - **Considered Alternatives**: Describe rejected options and why. ``` +## How to write ADRs + +ADRs must be succint and straight to the point. +They must contain examples with high information density and pedagogical value. +These are docs people usually don't want to read, but when they do, they find it quite fascinating. + # Plotnik Query Language Plotnik is a strongly-typed, whitespace-delimited pattern matching language for syntax trees (similar to Tree-sitter but stricter). diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index df2d8b67..b8376db7 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -13,7 +13,7 @@ The Query IR lives in a single contiguous allocation—cache-friendly, zero frag ### Container ```rust -struct TransitionGraph { +struct QueryIR { data: Arena, successors_offset: u32, effects_offset: u32, diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index d11c63dd..5c9769f0 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -125,12 +125,12 @@ No `CaptureNode`—implicit on successful match. ```rust struct TransitionView<'a> { - graph: &'a TransitionGraph, + query_ir: &'a QueryIR, raw: &'a Transition, } struct MatcherView<'a> { - graph: &'a TransitionGraph, + query_ir: &'a QueryIR, raw: &'a Matcher, } diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index a6f8adc6..e83612fe 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -89,7 +89,7 @@ struct Frame { } struct Interpreter<'a> { - graph: &'a TransitionGraph, + query_ir: &'a QueryIR, stack: Vec, cursor: TreeCursor<'a>, effects: EffectStream<'a>, From 0621b6c27ad3536178f6fb83ba1184a59b2bcf35 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 12:43:02 -0300 Subject: [PATCH 3/6] Update ADR-0004 with refined Query IR binary format --- docs/adr/ADR-0004-query-ir-binary-format.md | 43 ++++++++++++-------- docs/adr/ADR-0005-transition-graph-format.md | 7 +++- 2 files changed, 30 insertions(+), 20 deletions(-) diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index b8376db7..fdcdb3a4 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -14,31 +14,31 @@ The Query IR lives in a single contiguous allocation—cache-friendly, zero frag ```rust struct QueryIR { - data: Arena, + ir_buffer: QueryIRBuffer, successors_offset: u32, effects_offset: u32, negated_fields_offset: u32, string_refs_offset: u32, string_bytes_offset: u32, + type_info_offset: u32, entrypoints_offset: u32, - default_entrypoint: TransitionId, } ``` -Transitions start at offset 0 (implicit). +Transitions start at offset 0. Default entrypoint is always at offset 0. -### Arena +### QueryIRBuffer ```rust -const ARENA_ALIGN: usize = 4; +const BUFFER_ALIGN: usize = 4; -struct Arena { +struct QueryIRBuffer { ptr: *mut u8, len: usize, } ``` -Allocated via `Layout::from_size_align(len, ARENA_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. +Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. ### Segments @@ -50,11 +50,12 @@ Allocated via `Layout::from_size_align(len, ARENA_ALIGN)`. Standard `Box<[u8]>` | Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 | | String Refs | `[StringRef; R]` | `string_refs_offset` | 4 | | String Bytes | `[u8; S]` | `string_bytes_offset` | 1 | +| Type Info | `[TypeInfo; U]` | `type_info_offset` | 4 | | Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 | Each offset is aligned: `(offset + align - 1) & !(align - 1)`. -### Strings +### Stringsi Single pool for all strings (field names, variant tags, entrypoint names): @@ -81,15 +82,20 @@ Strings are interned during construction—identical strings share storage and I ### Serialization ``` -Header (20 bytes): +Header (44 bytes): magic: [u8; 4] b"PLNK" version: u32 format version + ABI hash - checksum: u32 CRC32(segment_offsets || arena_data) - arena_len: u32 - segment_count: u32 - -Segment Offsets (segment_count × 4 bytes) -Arena Data (arena_len bytes) + checksum: u32 CRC32(offsets || buffer_data) + buffer_len: u32 + successors_offset: u32 + effects_offset: u32 + negated_fields_offset: u32 + string_refs_offset: u32 + string_bytes_offset: u32 + type_info_offset: u32 + entrypoints_offset: u32 + +Buffer Data (buffer_len bytes) ``` Little-endian always. UTF-8 strings. Version mismatch or checksum failure → recompile. @@ -113,7 +119,7 @@ Func = (function_declaration name: (identifier) @name) Expr = [ Ident: (identifier) @name Num: (number) @value ] ``` -Arena layout: +Buffer layout: ``` 0x0000 Transitions [T0, T1, T2, ...] @@ -121,8 +127,9 @@ Arena layout: 0x0200 Effects [StartObject, Field(0), ...] 0x0280 Negated Fields [] 0x0280 String Refs [{0,4}, {4,5}, {9,5}, ...] -0x02C0 String Bytes "namevalueIdentNum FuncExpr" -0x0300 Entrypoints [{4, T0}, {5, T3}] +0x02C0 String Bytes "namevalueIdentNumFuncExpr" +0x0300 Type Info [...] +0x0340 Entrypoints [{4, T0}, {5, T3}] ``` `"name"` stored once, used by both `@name` captures. diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index 5c9769f0..54135c15 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -14,6 +14,8 @@ Edge-centric IR: transitions carry all semantics (matching, effects, successors) ```rust type TransitionId = u32; +type NodeTypeId = u16; // from tree-sitter, do not change +type NodeFieldId = NonZeroU16; // from tree-sitter, Option uses 0 for None type DataFieldId = u16; type VariantTagId = u16; type RefId = u16; @@ -65,6 +67,7 @@ enum Matcher { Anonymous { kind: NodeTypeId, // 2 field: Option, // 2 + negated_fields: Slice, // 8 }, Wildcard, Down, // cursor to first child @@ -73,7 +76,7 @@ enum Matcher { // 16 bytes, align 4 ``` -`NodeFieldId` is `NonZeroU16`—`Option` uses 0 for `None`. +`Option` uses 0 for `None` (niche optimization). ### RefTransition @@ -92,7 +95,7 @@ Explicit `None` ensures stable binary layout (`Option` niche is unspecifie ### EffectOp ```rust -#[repr(C)] +#[repr(C, u16)] enum EffectOp { StartArray, PushElement, From 599305383cc2954f71fe9897cae1824f016c90aa Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 14:01:28 -0300 Subject: [PATCH 4/6] Update ADR with recursion and backtracking details --- docs/adr/ADR-0005-transition-graph-format.md | 28 +++- docs/adr/ADR-0006-dynamic-query-execution.md | 151 +++++++++++-------- 2 files changed, 113 insertions(+), 66 deletions(-) diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index 54135c15..5c492fed 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -84,14 +84,38 @@ enum Matcher { #[repr(C, u8)] enum RefTransition { None, - Enter(RefId), // push return stack - Exit(RefId), // pop, must match + Enter(RefId), // push call frame with returns + Exit(RefId), // pop frame, use stored returns } // 4 bytes, align 2 ``` Explicit `None` ensures stable binary layout (`Option` niche is unspecified). +### Enter/Exit Semantics + +**Problem**: A definition can be called from multiple sites. Naively, `Exit.next` would contain all possible return points from all call sites, requiring O(N) filtering at runtime to find which return is valid for the current call. + +**Solution**: Store return transitions at `Enter` time (in the call frame), retrieve at `Exit` time. O(1) exit, no filtering. + +For `Enter(ref_id)` transitions, `next` has special structure: + +- `next[0]`: definition entry point (where to jump) +- `next[1..]`: return transitions (stored in call frame) + +For `Exit(ref_id)` transitions, `next` is **ignored**. Return transitions come from the call frame pushed at `Enter`. See [ADR-0006](ADR-0006-dynamic-query-execution.md) for execution details. + +``` +Call site: +T1: ε + Enter(Func) next=[T10, T2, T3] + │ └─────┴─── return transitions (stored in frame) + └─────────────── definition entry + +Definition: +T10: Match(...) next=[T11] +T11: ε + Exit(Func) next=[] (ignored, returns from frame) +``` + ### EffectOp ```rust diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index e83612fe..385cd9a3 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -22,20 +22,20 @@ For each transition: ### Effect Stream ```rust -enum RuntimeEffect<'a> { - Op(EffectOp), - CaptureNode(Node<'a>), // implicit on match, never in graph +struct EffectStream<'a> { + effects: Vec>, // append-only, backtrack via truncate } -struct EffectStream<'a> { - effects: Vec>, +enum RuntimeEffect<'a> { + Op(EffectOp), + CaptureNode(Node<'a>), // implicit on match, never in IR } ``` -Append-only. Backtrack via `truncate(watermark)`. - ### Executor +Converts effect stream to output value. + ```rust struct Executor<'a> { current: Option>, @@ -72,89 +72,112 @@ enum Container<'a> { Invalid state = IR bug → panic. -### Backtracking +### Interpreter + +```rust +struct Interpreter<'a> { + query_ir: &'a QueryIR, + backtrack_stack: BacktrackStack, + recursion_stack: RecursionStack, + cursor: TreeCursor<'a>, // created at tree root, never reset + effects: EffectStream<'a>, +} +``` -Two checkpoints, saved together: +**Cursor constraint**: The cursor must be created once at the tree root and never call `reset()`. This preserves `descendant_index` validity for backtracking checkpoints. -- `cursor.descendant_index()` → restore via `goto_descendant(pos)` -- `effect_stream.len()` → restore via `truncate(watermark)` +Two stacks interact: backtracking can restore to a point inside a previously-exited call, so the recursion stack must preserve frames. -### Recursion +### Backtracking ```rust -struct Frame { - ref_id: RefId, - cursor_checkpoint: usize, - effect_watermark: usize, +struct BacktrackStack { + points: Vec, } -struct Interpreter<'a> { - query_ir: &'a QueryIR, - stack: Vec, - cursor: TreeCursor<'a>, - effects: EffectStream<'a>, +struct BacktrackPoint { + cursor_checkpoint: u32, // tree-sitter descendant_index + effect_watermark: u32, + recursion_frame: Option, // saved frame index + alternatives: Slice, } ``` -`Enter(ref_id)`: push frame, follow `next` into definition. +| Operation | Action | +| --------- | ------------------------------------------------------ | +| Save | `cursor_checkpoint = cursor.descendant_index()` — O(1) | +| Restore | `cursor.goto_descendant(cursor_checkpoint)` — O(depth) | -`Exit(ref_id)`: verify match, pop frame, continue unconditionally. +Restore also truncates `effects` to `effect_watermark` and sets `recursion_stack.current` to `recursion_frame`. -Entry filtering: only take `Exit(ref_id)` if it matches stack top. +### Recursion -### Example +**Problem**: A definition can be called from N sites. Naively, `Exit.next` contains all N return points, requiring O(N) filtering. -Query: +**Solution**: Store returns in call frame at `Enter`, retrieve at `Exit`. O(1), no filtering. -``` -Func = (function_declaration - name: (identifier) @name - parameters: (parameters (identifier)* @params :: string)) +```rust +struct RecursionStack { + frames: Vec, // append-only + current: Option, // index into frames, not depth +} + +struct CallFrame { + parent: Option, // index of caller's frame + ref_id: RefId, // verify Exit matches Enter + returns: Slice, // from Enter.next[1..] +} ``` -Input: `function foo(a, b) {}` +**Append-only invariant**: Frames are never removed. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. -**Phase 1: Match → Effect Stream** +| Operation | Action | +| ----------------- | -------------------------------------------------------------------------- | +| `Enter(ref_id)` | Push frame (parent = `current`), set `current = len-1`, follow `next[0]` | +| `Exit(ref_id)` | Verify ref_id, set `current = frame.parent`, continue with `frame.returns` | +| Save backtrack | Store `current` | +| Restore backtrack | Set `current` to saved value | + +**Why index instead of depth?** Using logical depth breaks on Enter-Exit-Enter sequences: ``` -pre: StartObject -match function_declaration → CaptureNode(func) -match identifier "foo" → CaptureNode(foo) -post: Field("name") -pre: StartArray -match identifier "a" → CaptureNode(a), ToString, PushElement -match identifier "b" → CaptureNode(b), ToString, PushElement -post: EndArray, Field("params"), EndObject +Main = [(A) (B)] +A = (identifier) +B = (number) +Input: boolean + +# Broken (depth-based): +1. Save BP depth=0 +2. Enter(A) push FA, depth=1 +3. Match identifier ✗ +4. Exit(A) depth=0 +5. Restore BP depth=0 +6. Enter(B) push FB, frames=[FA,FB], depth=1 +7. frames[depth-1] = FA, not FB! ← wrong frame + +# Correct (index-based): +1. Save BP current=None +2. Enter(A) push FA{parent=None}, current=0 +3. Match identifier ✗ +4. Exit(A) current=None +5. Restore BP current=None +6. Enter(B) push FB{parent=None}, current=1 +7. frames[current] = FB ✓ ``` -**Phase 2: Execute → Value** - -| Effect | current | stack | -| ---------------- | --------- | ---------------- | -| StartObject | — | [{}] | -| CaptureNode(foo) | Node(foo) | [{}] | -| Field("name") | — | [{name:Node}] | -| StartArray | — | [{…}, []] | -| CaptureNode(a) | Node(a) | [{…}, []] | -| ToString | "a" | [{…}, []] | -| PushElement | — | [{…}, ["a"]] | -| CaptureNode(b) | Node(b) | [{…}, ["a"]] | -| ToString | "b" | [{…}, ["a"]] | -| PushElement | — | [{…}, ["a","b"]] | -| EndArray | ["a","b"] | [{…}] | -| Field("params") | — | [{…,params}] | -| EndObject | {…} | [] | - -Result: `{ name: , params: ["a", "b"] }` +Frames form a forest of call chains. Each backtrack point references an exact frame, not a depth. + +### Atomic Groups (Future) + +Cut/commit (discard backtrack points) works correctly: unreachable frames become garbage but cause no issues. ### Variant Serialization ```json -{ "$tag": "A", "$data": { "x": 1 } } -{ "$tag": "B", "$data": [1, 2, 3] } +{ "$tag": "A", "$data": { ... } } ``` -Uniform structure. `$tag`/`$data` avoid capture collisions. +`$tag`/`$data` avoid capture name collisions. ### Fuel @@ -165,9 +188,9 @@ Details deferred. ## Consequences -**Positive**: Append-only stream makes backtracking trivial. Two-phase separation is clean. +**Positive**: Append-only stacks make backtracking trivial. O(1) exit via stored returns. Two-phase separation is clean. -**Negative**: Interpretation overhead. Extra pass for effect execution. +**Negative**: Interpretation overhead. Recursion stack memory grows monotonically (bounded by `recursion_fuel`). ## References From 90b92263127b23c48770a8dae69e025a64aa8217 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 14:21:36 -0300 Subject: [PATCH 5/6] Update transition graph format with cache-line alignment and inline successors --- AGENTS.md | 1 + docs/adr/ADR-0004-query-ir-binary-format.md | 6 +- docs/adr/ADR-0005-transition-graph-format.md | 67 ++++++++++++++------ docs/adr/ADR-0006-dynamic-query-execution.md | 18 +++--- 4 files changed, 62 insertions(+), 30 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 23a612bd..4de08658 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -48,6 +48,7 @@ ADRs must be succint and straight to the point. They must contain examples with high information density and pedagogical value. These are docs people usually don't want to read, but when they do, they find it quite fascinating. +Avoid imperative code, describe structure definitions, their purpose and how to use them properly. # Plotnik Query Language diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index fdcdb3a4..01f84416 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -30,7 +30,7 @@ Transitions start at offset 0. Default entrypoint is always at offset 0. ### QueryIRBuffer ```rust -const BUFFER_ALIGN: usize = 4; +const BUFFER_ALIGN: usize = 64; // cache-line alignment for transitions struct QueryIRBuffer { ptr: *mut u8, @@ -38,13 +38,13 @@ struct QueryIRBuffer { } ``` -Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. +Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. The 64-byte alignment ensures transitions never straddle cache lines. ### Segments | Segment | Type | Offset | Align | | -------------- | ------------------- | ----------------------- | ----- | -| Transitions | `[Transition; N]` | 0 | 4 | +| Transitions | `[Transition; N]` | 0 | 64 | | Successors | `[TransitionId; M]` | `successors_offset` | 4 | | Effects | `[EffectOp; P]` | `effects_offset` | 2 | | Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 | diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index 5c492fed..d03f02a0 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -37,22 +37,51 @@ struct Slice { ### Transition ```rust -#[repr(C)] +#[repr(C, align(64))] struct Transition { - matcher: Matcher, // 16 bytes + // --- 40 bytes metadata --- + matcher: Matcher, // 16 pre_anchored: bool, // 1 post_anchored: bool, // 1 _pad1: [u8; 2], // 2 pre_effects: Slice, // 8 post_effects: Slice, // 8 ref_marker: RefTransition, // 4 - next: Slice, // 8 + + // --- 24 bytes control flow --- + successor_count: u32, // 4 + successor_data: [u32; 5], // 20 } -// 48 bytes, align 4 +// 64 bytes, align 64 (cache-line aligned) ``` Single `ref_marker` slot—sequences like `Enter(A) → Enter(B)` remain as epsilon chains. +### Inline Successors (SSO-style) + +Successors use a small-size optimization to avoid indirection for the common case: + +| `successor_count` | Layout | +| ----------------- | ------------------------------------------------------------------------------------ | +| 0–5 | `successor_data[0..count]` contains `TransitionId` values directly | +| > 5 | `successor_data[0]` is offset into `successors` segment, `successor_count` is length | + +Why 5 slots: 24 available bytes / 4 bytes per `TransitionId` = 6 slots, minus 1 for the count field leaves 5. + +Coverage: + +- Linear sequences: 1 successor +- Simple branches, quantifiers: 2 successors +- Most alternations: 2–5 branches + +Only massive alternations (6+ branches) spill to the external buffer. + +Cache benefits: + +- 64 bytes = L1 cache line on x86/ARM64 +- No transition straddles cache lines +- No pointer chase for 99%+ of transitions + ### Matcher ```rust @@ -98,23 +127,25 @@ Explicit `None` ensures stable binary layout (`Option` niche is unspecifie **Solution**: Store return transitions at `Enter` time (in the call frame), retrieve at `Exit` time. O(1) exit, no filtering. -For `Enter(ref_id)` transitions, `next` has special structure: +For `Enter(ref_id)` transitions, `successor_data` has special structure: -- `next[0]`: definition entry point (where to jump) -- `next[1..]`: return transitions (stored in call frame) +- `successor_data[0]`: definition entry point (where to jump) +- `successor_data[1..count]`: return transitions (stored in call frame) -For `Exit(ref_id)` transitions, `next` is **ignored**. Return transitions come from the call frame pushed at `Enter`. See [ADR-0006](ADR-0006-dynamic-query-execution.md) for execution details. +For `Exit(ref_id)` transitions, successors are **ignored**. Return transitions come from the call frame pushed at `Enter`. See [ADR-0006](ADR-0006-dynamic-query-execution.md) for execution details. ``` Call site: -T1: ε + Enter(Func) next=[T10, T2, T3] - │ └─────┴─── return transitions (stored in frame) - └─────────────── definition entry +T1: ε + Enter(Func) successors=[T10, T2, T3] + │ └─────┴─── return transitions (stored in frame) + └─────────────── definition entry +``` Definition: -T10: Match(...) next=[T11] -T11: ε + Exit(Func) next=[] (ignored, returns from frame) -``` +T10: Match(...) successors=[T11] +T11: ε + Exit(Func) successors=[] (ignored, returns from frame) + +```` ### EffectOp @@ -132,7 +163,7 @@ enum EffectOp { ToString, } // 4 bytes, align 2 -``` +```` No `CaptureNode`—implicit on successful match. @@ -164,7 +195,7 @@ struct MatcherView<'a> { enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } ``` -Views resolve `Slice` to `&[T]`. Engine code never touches offsets directly. +Views resolve `Slice` to `&[T]`. `TransitionView::successors()` returns `&[TransitionId]`, hiding the inline/spilled distinction—callers see a uniform slice regardless of storage location. Engine code never touches offsets or `successor_data` directly. ### Quantifiers @@ -269,9 +300,9 @@ Incoming epsilon effects → `pre_effects`. Outgoing → `post_effects`. ## Consequences -**Positive**: No state objects. Compact 48-byte transitions. Views hide offset arithmetic. +**Positive**: No state objects. Cache-line aligned 64-byte transitions eliminate cache straddling. Inline successors remove pointer chasing for common cases. Views hide offset arithmetic and inline/spilled distinction. -**Negative**: Single `ref_marker` leaves some epsilon chains. Large queries may pressure cache. +**Negative**: Single `ref_marker` leaves some epsilon chains. 33% size increase over minimal layout (acceptable for KB-scale query binaries). ## References diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index 385cd9a3..1686133b 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -17,7 +17,7 @@ For each transition: 1. Emit `pre_effects` 2. Match (epsilon always succeeds) 3. On success: emit `CaptureNode`, emit `post_effects` -4. Process `next` with backtracking +4. Process successors with backtracking ### Effect Stream @@ -112,7 +112,7 @@ Restore also truncates `effects` to `effect_watermark` and sets `recursion_stack ### Recursion -**Problem**: A definition can be called from N sites. Naively, `Exit.next` contains all N return points, requiring O(N) filtering. +**Problem**: A definition can be called from N sites. Naively, Exit's successors contain all N return points, requiring O(N) filtering. **Solution**: Store returns in call frame at `Enter`, retrieve at `Exit`. O(1), no filtering. @@ -125,18 +125,18 @@ struct RecursionStack { struct CallFrame { parent: Option, // index of caller's frame ref_id: RefId, // verify Exit matches Enter - returns: Slice, // from Enter.next[1..] + returns: Slice, // from Enter.successors()[1..] } ``` **Append-only invariant**: Frames are never removed. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. -| Operation | Action | -| ----------------- | -------------------------------------------------------------------------- | -| `Enter(ref_id)` | Push frame (parent = `current`), set `current = len-1`, follow `next[0]` | -| `Exit(ref_id)` | Verify ref_id, set `current = frame.parent`, continue with `frame.returns` | -| Save backtrack | Store `current` | -| Restore backtrack | Set `current` to saved value | +| Operation | Action | +| ----------------- | ------------------------------------------------------------------------------ | +| `Enter(ref_id)` | Push frame (parent = `current`), set `current = len-1`, follow `successors[0]` | +| `Exit(ref_id)` | Verify ref_id, set `current = frame.parent`, continue with `frame.returns` | +| Save backtrack | Store `current` | +| Restore backtrack | Set `current` to saved value | **Why index instead of depth?** Using logical depth breaks on Enter-Exit-Enter sequences: From 843c05b2160e6c7886ea604d3c8fe900f4f30e18 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Fri, 12 Dec 2025 14:53:29 -0300 Subject: [PATCH 6/6] Update Index to Remove Direct Link and Note Availability via Git History --- AGENTS.md | 2 +- ...-0003-query-intermediate-representation.md | 924 ------------------ docs/adr/ADR-0004-query-ir-binary-format.md | 2 +- docs/adr/ADR-0005-transition-graph-format.md | 2 +- docs/adr/ADR-0006-dynamic-query-execution.md | 2 +- 5 files changed, 4 insertions(+), 928 deletions(-) delete mode 100644 docs/adr/ADR-0003-query-intermediate-representation.md diff --git a/AGENTS.md b/AGENTS.md index 4de08658..7f04014e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,7 +16,7 @@ - **Index**: - [ADR-0001: Query Parser](docs/adr/ADR-0001-query-parser.md) - [ADR-0002: Diagnostics System](docs/adr/ADR-0002-diagnostics-system.md) - - [ADR-0003: Query Intermediate Representation](docs/adr/ADR-0003-query-intermediate-representation.md) (superseded by ADR-0004, ADR-0005, ADR-0006) + - ADR-0003: Query Intermediate Representation (superseded by ADR-0004, ADR-0005, ADR-0006, available via git history) - [ADR-0004: Query IR Binary Format](docs/adr/ADR-0004-query-ir-binary-format.md) - [ADR-0005: Transition Graph Format](docs/adr/ADR-0005-transition-graph-format.md) - [ADR-0006: Dynamic Query Execution](docs/adr/ADR-0006-dynamic-query-execution.md) diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md deleted file mode 100644 index 59dc57a2..00000000 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ /dev/null @@ -1,924 +0,0 @@ -# ADR-0003: Query Intermediate Representation - -- **Status**: Superseded by [ADR-0004](ADR-0004-query-ir-binary-format.md), [ADR-0005](ADR-0005-transition-graph-format.md), [ADR-0006](ADR-0006-dynamic-query-execution.md) -- **Date**: 2025-12-10 - -## Context - -Plotnik needs to execute queries against tree-sitter syntax trees. The query language supports: - -- Named node and anonymous node matching -- Field constraints and negated fields -- Named definitions with mutual recursion -- Quantifiers (`*`, `+`, `?`) with greedy/non-greedy variants -- Alternations (tagged and untagged) -- Sequences -- Captures with type annotations -- Anchors for strict positional matching - -Plotnik supports two execution modes: - -1. **Proc macro (compile-time)**: Query is compiled to specialized Rust functions. Zero runtime interpretation overhead. Used when query is known at compile time. - -2. **Dynamic (runtime)**: Query is parsed and executed at runtime via graph interpretation. Used when query is provided by user input or loaded from files. - -Both modes share the same intermediate representation (IR). The IR must support efficient execution in both contexts. - -The design evolved through several realizations: - -1. **Thompson-style fragment composition**: We adapted Thompson's technique for composing pattern fragments—alternation, sequence, quantifiers. However, unlike classic NFAs (which handle only regular languages), our representation supports recursion via a return stack. - -2. **Transitions do all the work**: Each transition performs navigation, matching, and effect emission. States are just junction points with no semantics. - -3. **Edge-centric model**: Transitions are primary, states are implicit. The IR is a flat array of transitions, each knowing its successors. The result is a _recursive transition network_—like an NFA but with call/return semantics for definition references. - -## Decision - -We adopt an edge-centric intermediate representation where: - -1. **Transitions are primary**: Each transition carries matching logic, effects, and successor links -2. **States are implicit**: No explicit state objects; transitions point directly to successor transitions -3. **Effects are append-only**: Data construction emits effects to a linear stream. Backtracking truncates to a saved watermark—no complex undo logic, just `Vec::truncate` -4. **Shared IR, different executors**: The same `TransitionGraph` serves both proc macro codegen and dynamic interpretation - -### Core Data Structures - -These structures are used by both execution modes. - -#### Transition Graph Container - -The graph is immutable after construction. We use a single contiguous allocation sliced into typed segments with proper alignment handling. - -```rust -struct TransitionGraph { - data: Arena, // custom type, see Memory Layout & Alignment - // segment offsets (aligned for each type) - successors_offset: u32, - effects_offset: u32, - negated_fields_offset: u32, - data_fields_offset: u32, - variant_tags_offset: u32, - entrypoint_names_offset: u32, - entrypoints_offset: u32, - default_entrypoint: TransitionId, -} - -impl TransitionGraph { - fn new() -> Self; - fn get(&self, id: TransitionId) -> TransitionView<'_>; - fn entry(&self, name: &str) -> Option>; - fn default_entry(&self) -> TransitionView<'_>; - fn field_name(&self, id: DataFieldId) -> &str; - fn tag_name(&self, id: VariantTagId) -> &str; - fn entrypoint_name(&self, entry: &Entrypoint) -> &str; -} -``` - -##### Memory Arena Design - -The single contiguous allocation is divided into typed segments. Each segment is properly aligned for its type, ensuring safe access across all architectures (x86, ARM, WASM). - -**Segment Layout**: - -| Segment | Type | Offset | Alignment | -| ---------------- | ------------------- | ------------------------- | --------- | -| Transitions | `[Transition; N]` | 0 (implicit) | 4 bytes | -| Successors | `[TransitionId; M]` | `successors_offset` | 4 bytes | -| Effects | `[EffectOp; P]` | `effects_offset` | 2 bytes | -| Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 bytes | -| Data Fields | `[u8; R]` | `data_fields_offset` | 1 byte | -| Variant Tags | `[u8; S]` | `variant_tags_offset` | 1 byte | -| Entrypoint Names | `[u8; U]` | `entrypoint_names_offset` | 1 byte | -| Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 bytes | - -Transitions always start at offset 0—no explicit offset stored. The arena base address is allocated with 4-byte alignment, satisfying `Transition`'s requirement. - -Note: `entry(&str)` performs linear scan — O(n) where n = definition count (typically <20). - -##### Memory Layout & Alignment - -Casting `&u8` to `&T` when the address is not aligned to `T` causes traps on WASM and faults on strict ARM. We enforce alignment explicitly via a custom arena type. - -**A. Base Allocation Alignment** - -The arena must be allocated with alignment equal to the maximum of all segment types. We use a custom `Arena` wrapper that tracks the allocation layout for correct deallocation: - -```rust -const ARENA_ALIGN: usize = 4; // align_of::() - -struct Arena { - ptr: *mut u8, - len: usize, -} - -impl Arena { - fn new(len: usize) -> Self { - let layout = std::alloc::Layout::from_size_align(len, ARENA_ALIGN).unwrap(); - let ptr = unsafe { std::alloc::alloc(layout) }; - if ptr.is_null() { - std::alloc::handle_alloc_error(layout); - } - Self { ptr, len } - } -} - -impl Drop for Arena { - fn drop(&mut self) { - let layout = std::alloc::Layout::from_size_align(self.len, ARENA_ALIGN).unwrap(); - unsafe { std::alloc::dealloc(self.ptr, layout) }; - } -} -``` - -Standard `Box<[u8]>` cannot be used here: it assumes 1-byte alignment and would pass incorrect layout to `dealloc`, causing undefined behavior. - -**B. Segment Offset Calculation** - -Each segment offset is rounded up to its type's alignment: - -```rust -fn align_up(offset: usize, align: usize) -> usize { - (offset + align - 1) & !(align - 1) -} - -// Example: if Transitions end at byte 103, Successors (align 4) start at 104 -let successors_offset = align_up(transitions_end, align_of::()); -``` - -**C. Entrypoints Structure** - -Entrypoints use fixed-size metadata with indirect string storage: - -```rust -#[repr(C)] -struct Entrypoint { - name_offset: u32, // index into entrypoint_names segment - name_len: u32, - target: TransitionId, // Index into transitions segment (offset 0) -} -// Size: 12 bytes, Align: 4 bytes -``` - -The `name_offset` points into the `entrypoint_names` segment (u8 array), where alignment is irrelevant. This avoids the alignment hazards of inline variable-length strings. Note: entrypoint names are stored separately from data field names because they serve different purposes—entrypoint names identify subqueries for lookup, while data field names are used in output object construction. - -##### Construction Process - -The graph is built in two passes to ensure a single contiguous allocation: - -1. **Analysis Pass**: Traverse the Query AST to count all elements (transitions, effects, negated fields, strings). -2. **Layout & Allocation**: Compute aligned offsets for all segments and allocate the `Arena` once. -3. **Emission Pass**: Serialize data into the arena. - - **String Tables**: Written sequentially to `u8` segments. - - **Slices**: `Slice` fields are populated with `start` indices relative to their segment base. - - **Structs**: `Transition`, `Entrypoint`, etc., are written using `std::ptr::write` to their calculated offsets. - -This approach eliminates dynamic resizing (`realloc`) and fragmentation. - -##### Slice Resolution - -`Slice` handles are resolved to actual slices by combining: - -1. The segment's base offset (e.g., `effects_offset` for `Slice`) -2. The slice's `start` field (element index within segment) -3. The slice's `len` field - -The `TransitionView` methods (`pre_effects()`, `post_effects()`, `next()`) perform this resolution internally, returning standard `&[T]` slices. Engine code never performs offset arithmetic directly. - -**Access Pattern**: - -The `TransitionView` and `MatcherView` types provide safe access by: - -- Resolving `Slice` handles to actual slices within the appropriate segment -- Converting relative indices to absolute pointers -- Hiding all offset arithmetic from the query engine - -This design achieves: - -- **Cache efficiency**: All graph data in one contiguous allocation -- **Memory efficiency**: No per-node allocations, minimal overhead -- **Type safety**: Phantom types ensure slices point to correct segments -- **Zero-copy**: Direct references into the arena, no cloning - -#### Transition View - -`TransitionView` bundles a graph reference with a transition, enabling ergonomic access without explicit slice resolution: - -```rust -struct TransitionView<'a> { - graph: &'a TransitionGraph, - raw: &'a Transition, -} - -impl<'a> TransitionView<'a> { - fn matcher(&self) -> MatcherView<'a>; - fn next(&self) -> impl Iterator>; - fn pre_effects(&self) -> &[EffectOp]; - fn post_effects(&self) -> &[EffectOp]; - fn is_pre_anchored(&self) -> bool; - fn is_post_anchored(&self) -> bool; - fn ref_marker(&self) -> Option<&RefTransition>; -} - -struct MatcherView<'a> { - graph: &'a TransitionGraph, - raw: &'a Matcher, -} - -impl<'a> MatcherView<'a> { - fn kind(&self) -> MatcherKind; - fn node_kind(&self) -> Option; - fn field(&self) -> Option; - fn negated_fields(&self) -> &[NodeFieldId]; // resolved from Slice - fn matches(&self, cursor: &mut TreeCursor) -> bool; -} - -enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } -``` - -**Execution Flow**: - -The engine traverses transitions following this pattern: - -1. **Pre-effects** execute unconditionally before any matching attempt -2. **Matching** determines whether to proceed: - - With matcher: Test against current cursor position - - Without matcher (epsilon): Always proceed -3. **On successful match**: Implicitly capture the node, execute post-effects -4. **Successors** are processed recursively, with appropriate backtracking - -The `TransitionView` abstraction hides all segment access complexity. The same logical flow applies to both execution modes—dynamic interpretation emits effects while proc-macro generation produces direct construction code. - -#### Slice Handle - -A compact, relative reference to a contiguous range within a segment. Replaces `&[T]` to keep structs self-contained. - -```rust -#[repr(C)] -struct Slice { - start: u32, // Index within segment - len: u32, // Number of items - _phantom: PhantomData, -} - -impl Slice { - const EMPTY: Self = Self { start: 0, len: 0, _phantom: PhantomData }; -} - -// Note: PhantomData does not prevent crossing segment boundaries (e.g. passing -// Slice where Slice is expected). Implementers should -// consider wrapping these in newtypes if stricter compile-time safety is required. -``` - -Size: 8 bytes. Using `u32` for both fields fills the natural alignment with no padding waste, supporting up to 4B items per slice—well beyond any realistic query. - -#### Raw Transition - -Internal storage. Engine code uses `TransitionView` instead of accessing this directly. - -```rust -#[repr(C)] -struct Transition { - matcher: Matcher, // 16 bytes (Epsilon variant for epsilon-transitions) - pre_anchored: bool, // 1 byte - post_anchored: bool, // 1 byte - _pad1: [u8; 2], // 2 bytes padding - pre_effects: Slice, // 8 bytes - post_effects: Slice, // 8 bytes - ref_marker: RefTransition, // 4 bytes (explicit None variant) - next: Slice, // 8 bytes -} -// Size: 48 bytes, Align: 4 bytes -``` - -The `TransitionView` resolves `Slice` by combining the graph's segment offset with the slice's start/len fields. - -**Design Note**: The `ref_marker` field is intentionally a single `RefTransition` rather than a `Slice`. This means a transition can carry at most one Enter or Exit marker. While this prevents full epsilon elimination for nested reference sequences (e.g., `Enter(A) → Enter(B)`), we accept this limitation for simplicity. Such sequences remain as chains of epsilon transitions in the final graph. - -```rust -type TransitionId = u32; -type DataFieldId = u16; -type VariantTagId = u16; -type RefId = u16; -``` - -Each named definition has an entry point. The default entry is the last definition. Multiple entry points share the same transition graph. - -#### Matcher - -Note: `NodeTypeId` is `u16`. `NodeFieldId` is `NonZeroU16`, which guarantees `Option` uses value `0` for `None` — a stable layout suitable for raw serialization. Both types are defined in `plotnik-core`. - -```rust -#[repr(C, u32)] -enum Matcher { - Epsilon, // no payload - Node { - kind: NodeTypeId, // 2 bytes - field: Option, // 2 bytes - negated_fields: Slice, // 8 bytes - }, - Anonymous { - kind: NodeTypeId, // 2 bytes - field: Option, // 2 bytes - }, - Wildcard, - Down, - Up, -} -// Size: 16 bytes (4-byte discriminant + 12-byte largest variant). Align: 4 bytes -``` - -Navigation variants `Down`/`Up` move the cursor without matching. They enable nested patterns like `(function_declaration (identifier) @name)` where we must descend into children. - -#### Reference Markers - -```rust -#[repr(C, u8)] // Explicit 1-byte discriminant for stable serialization -enum RefTransition { - None, // no marker (discriminant 0) - Enter(RefId), // push ref_id onto return stack (discriminant 1) - Exit(RefId), // pop from return stack, must match ref_id (discriminant 2) -} -// Size: 4 bytes (1-byte discriminant + 1-byte padding + 2-byte RefId), Align: 2 bytes -``` - -The explicit `None` variant (rather than `Option`) ensures stable binary layout for raw arena serialization. `Option` relies on compiler-specific niche optimization whose bit-pattern is unspecified. - -Thompson construction creates epsilon transitions with `Enter`/`Exit` markers. Epsilon elimination propagates these markers to surviving transitions. At runtime, the engine uses markers to filter which `next` transitions are valid based on return stack state. Multiple transitions can share the same `RefId` after epsilon elimination. - -#### Effect Operations - -Instructions stored in the transition graph. These are static, `Copy`, and contain no runtime data. - -```rust -#[derive(Clone, Copy)] -#[repr(C)] -enum EffectOp { - StartArray, // push new [] onto container stack - PushElement, // move current value into top array - EndArray, // pop array from stack, becomes current - StartObject, // push new {} onto container stack - EndObject, // pop object from stack, becomes current - Field(DataFieldId), // move current value into field on top object - StartVariant(VariantTagId), // push variant tag onto container stack - EndVariant, // pop variant from stack, wrap current, becomes current - ToString, // convert current Node value to String (source text) -} -// Size: 4 bytes (1-byte discriminant + 2-byte payload + 1-byte padding), Align: 2 bytes -``` - -Note: There is no `CaptureNode` instruction. Node capture is implicit—a successful match automatically emits `RuntimeEffect::CaptureNode` to the effect stream (see below). - -Effects capture structure only—arrays, objects, variants. Type annotations (`:: str`, `:: Type`) are separate metadata applied during post-processing. - -##### Effect Placement Rules - -After epsilon elimination, effects are classified as pre or post based on when they must execute relative to the match: - -| Effect | Placement | Reason | -| -------------- | --------- | ------------------------------------------ | -| `StartArray` | Pre | Container must exist before elements added | -| `StartObject` | Pre | Container must exist before fields added | -| `StartVariant` | Pre | Tag must be set before payload captured | -| `PushElement` | Post | Consumes the just-matched node | -| `Field` | Post | Consumes the just-matched node | -| `EndArray` | Post | Finalizes after last element matched | -| `EndObject` | Post | Finalizes after last field matched | -| `EndVariant` | Post | Wraps payload after it's captured | -| `ToString` | Post | Converts the just-matched node to text | - -Pre-effects from incoming epsilon paths accumulate in order. Post-effects from outgoing epsilon paths accumulate in order. This ordering is deterministic and essential for correct data construction. - -### Data Construction (Dynamic Interpreter) - -This section describes data construction for the dynamic interpreter. Proc-macro codegen uses direct construction instead (see [Direct Construction](#direct-construction-no-effect-stream)). - -The interpreter emits events to a linear stream during matching. After a successful match, the stream is executed to build the output. - -#### Runtime Effects - -Events emitted to the effect stream during interpretation. Unlike `EffectOp`, these carry runtime data. - -```rust -enum RuntimeEffect<'a> { - Op(EffectOp), // forwarded instruction from graph - CaptureNode(Node<'a>), // emitted implicitly on successful match -} -``` - -The `CaptureNode` variant is never stored in the graph—it's generated by the interpreter when a match succeeds. This separation keeps the graph static (no lifetimes) while allowing the runtime stream to carry actual node references. - -#### Effect Stream - -```rust -/// Accumulates runtime effects during matching; supports rollback on backtrack -struct EffectStream<'a> { - effects: Vec>, -} -``` - -The effect stream accumulates effects linearly during matching. It provides: - -- **Effect emission**: Appends `EffectOp` instructions and `CaptureNode` events -- **Watermarking**: Records position before attempting branches -- **Rollback**: Truncates to saved position on backtrack - -This append-only design makes backtracking trivial—just truncate the vector. No complex undo logic needed. - -#### Execution Model - -Two separate concepts during effect execution: - -1. **Current value** — the last matched node or just-completed container -2. **Container stack** — objects and arrays being built - -```rust -struct Executor<'a> { - current: Option>, // last matched node or completed container - stack: Vec>, // objects/arrays being built -} - -// Result is the final `current` value after execution completes. -// This allows returning any value type: Object, Array, Node, String, or Variant. - -enum Value<'a> { - Node(Node<'a>), // AST node reference - String(String), // Text values (from @capture :: string) - Array(Vec>), // completed array - Object(BTreeMap>), // completed object (BTreeMap for deterministic iteration) - Variant(VariantTagId, Box>), // tagged variant (tag + payload) -} - -enum Container<'a> { - Array(Vec>), // array under construction - Object(BTreeMap>), // object under construction - Variant(VariantTagId), // variant tag; EndVariant wraps current value -} -``` - -Effect semantics on `current`: - -- `CaptureNode(node)` → sets `current` to `Value::Node(node)` -- `Field(id)` → moves `current` into top object, clears to `None` -- `PushElement` → moves `current` into top array, clears to `None` -- `End*` → pops container from stack into `current` -- `ToString` → replaces `current` Node with its source text as String - -**Error Handling** - -The interpreter assumes the effect stream is well-formed (guaranteed by the query compiler). - -- **Panic**: Any operation on invalid state (e.g., `Field` when `current` is `None`, `EndArray` with empty stack, `ToString` on non-Node). These indicate bugs in the IR construction. - -#### Execution Pipeline - -For any given transition, the execution order is strict to ensure data consistency during backtracking: - -1. **Enter**: Push `Frame` with current `effect_stream.watermark()`. -2. **Pre-Effects**: Emit `pre_effects` as `RuntimeEffect::Op(...)`. -3. **Match**: Validate node kind/fields. If fail, rollback to watermark and abort. -4. **Capture**: Emit `RuntimeEffect::CaptureNode(matched_node)` — implicit, not from graph. -5. **Post-Effects**: Emit `post_effects` as `RuntimeEffect::Op(...)`. -6. **Exit**: Pop `Frame` (validate return). - -This order ensures correct behavior during epsilon elimination. Pre-effects run before the match overwrites `current`, allowing effects like `PushElement` to be safely merged from preceding epsilon transitions. Post-effects run after, for effects that need the newly matched node. - -The key insight: `CaptureNode` is generated by the interpreter on successful match, not stored as an instruction. The graph only contains structural operations (`EffectOp`); the runtime stream (`RuntimeEffect`) adds the actual node data. - -#### Example - -Query: - -``` -Func = (function_declaration - name: (identifier) @name - parameters: (parameters (identifier)* @params :: string)) -``` - -Input: `function foo(a, b) {}` - -Runtime effect stream (showing `EffectOp` from graph vs implicit `CaptureNode`): - -``` -graph pre: Op(StartObject) -implicit: CaptureNode(foo) ← from successful match -graph post: Op(Field("name")) -graph pre: Op(StartArray) -implicit: CaptureNode(a) ← from successful match -graph post: Op(ToString) -graph post: Op(PushElement) -implicit: CaptureNode(b) ← from successful match -graph post: Op(ToString) -graph post: Op(PushElement) -graph post: Op(EndArray) -graph post: Op(Field("params")) -graph post: Op(EndObject) -``` - -Note: The graph stores only `EffectOp` instructions. `CaptureNode` events are generated by the interpreter on each successful match—they never appear in `Transition.pre_effects` or `Transition.post_effects`. - -In the raw graph, `EffectOp`s live on epsilon transitions between matches. The pre/post classification determines where they land after epsilon elimination. `StartObject` and `StartArray` are pre-effects (setup before matching). `Field`, `PushElement`, `ToString`, and `End*` are post-effects (consume the matched node or finalize containers). - -Execution trace (key steps, second array element omitted): - -| RuntimeEffect | current | stack | -| ------------------- | ---------- | --------------- | -| Op(StartObject) | - | [{}] | -| CaptureNode(foo) | Node(foo) | [{}] | -| Op(Field("name")) | - | [{name: Node}] | -| Op(StartArray) | - | [{...}, []] | -| CaptureNode(a) | Node(a) | [{...}, []] | -| Op(ToString) | "a" | [{...}, []] | -| Op(PushElement) | - | [{...}, ["a"]] | -| _(repeat for "b")_ | ... | ... | -| Op(EndArray) | ["a", "b"] | [{...}] | -| Op(Field("params")) | - | [{..., params}] | -| Op(EndObject) | {...} | [] | - -Final result: - -```json -{ - "name": "", - "params": ["a", "b"] -} -``` - -### Backtracking - -Two mechanisms work together (same for both execution modes): - -1. **Cursor checkpoint**: `cursor.descendant_index()` returns a `usize` position; `cursor.goto_descendant(pos)` restores it. O(1) save, O(depth) restore, no allocation. - -2. **Effect watermark**: `effect_stream.watermark()` before attempting a branch; `effect_stream.rollback(watermark)` on failure. - -Both execution modes follow the same pattern: save state before attempting a branch; on failure, restore both cursor and effects before trying the next branch. This ensures each alternative starts from the same clean state. - -``` - -### Quantifiers - -Quantifiers compile to epsilon transitions with specific `next` ordering: - -**Greedy `*`** (zero or more): - -``` - -Entry ─ε→ [try match first, then exit] -↓ -Match ─ε→ loop back to Entry - -``` - -**Greedy `+`** (one or more): - -``` - - ┌──────────────────────────┐ - ↓ │ - -Entry ─→ Match ─ε→ Loop ─ε→ [try match first, then exit] - -``` - -The `+` quantifier differs from `*`: it enters directly at `Match`, requiring at least one successful match before the exit path becomes available. After the first match, the `Loop` node behaves like `*` (match-first, exit-second). - -**Non-greedy `*?`/`+?`**: - -Same structures as above, but with reversed `next` ordering: exit path has priority over match path. For `+?`, after the mandatory first match, the loop prefers exiting over matching more. - -### Arrays - -Array construction uses epsilon transitions with effects: - -``` - -T0: ε + StartArray next: [T1] // pre-effect: setup array -T1: ε (branch) next: [T2, T4] // try match or exit -T2: Match(expr) next: [T3] -T3: ε + PushElement next: [T1] // post-effect: consume matched node -T4: ε + EndArray next: [T5] // post-effect: finalize array -T5: ε + Field("items") next: [...] // post-effect: assign to field - -``` - -After epsilon elimination, `PushElement` from T3 merges into T2 as a post-effect. `StartArray` from T0 merges into T2 as a pre-effect (first iteration only—loop iterations enter from T3, not T0). - -Backtracking naturally handles partial arrays: truncating the effect stream removes uncommitted `PushElement` effects. - -### Scopes - -Nested objects from `{...} @name` use `StartObject`/`EndObject` effects: - -``` - -T0: ε + StartObject next: [T1] // pre-effect: setup object -T1: ... (sequence contents) next: [T2] -T2: ε + EndObject next: [T3] // post-effect: finalize object -T3: ε + Field("name") next: [...] // post-effect: assign to field - -``` - -`StartObject` is a pre-effect (merges forward). `EndObject` and `Field` are post-effects (merge backward onto preceding match). - -### Tagged Alternations - -Tagged branches use `StartVariant` to create explicit tagged structures. - -``` - -[ A: (true) ] - -``` - -Effect stream: - -``` - -StartVariant("A") -StartObject -... -EndObject -EndVariant - -```` - -The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. - -**JSON serialization** always uses `$data` wrapper for uniformity: - -```json -{ "$tag": "A", "$data": { "x": 1, "y": 2 } } -{ "$tag": "B", "$data": [1, 2, 3] } -{ "$tag": "C", "$data": "foo" } -```` - -The `$tag` and `$data` keys avoid collisions with user-defined captures. Uniform structure simplifies parsing (always access `.$data`) and eliminates conditional flatten-vs-wrap logic. - -**Nested variants** (variant containing variant) serialize naturally: - -```json -{ "$tag": "Outer", "$data": { "$tag": "Inner", "$data": 42 } } -``` - -This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. - -### Definition References and Recursion - -When a pattern references another definition (e.g., `(Expr)` inside `Binary`), the IR uses `RefId` to identify the call site. Each `Ref` node in the query AST gets a unique `RefId`, which is preserved through epsilon elimination. - -``` -Expr = [ (Num) (Binary) ] -Binary = (binary_expression - left: (Expr) // RefId = 0 - right: (Expr)) // RefId = 1 -``` - -The `RefId` is semantic identity—"which reference in the query pattern"—distinct from `TransitionId` which is structural identity—"which slot in the transition array." - -**Why RefId matters**: Epsilon elimination creates multiple transitions from a single reference. If a reference has 2 input epsilon paths and 3 output epsilon paths, elimination produces 2×3 = 6 transitions. All share the same `RefId` because they represent the same call site. The return stack uses `RefId` so that: - -- Entry can occur via any input path -- Exit can continue via any output path - -**Proc macro**: Each definition becomes a Rust function. References become function calls. Rust's call stack serves as the return stack—`RefId` is implicit in the call site. - -In proc-macro mode, each definition becomes a Rust function. References become direct function calls, with the Rust call stack serving as the implicit return stack. The `RefId` exists only in the IR—the generated code relies on Rust's natural call/return mechanism. - -**Dynamic**: The interpreter maintains an explicit return stack. On `Enter(ref_id)`: - -1. Push frame with `ref_id`, cursor checkpoint, effect stream watermark -2. Follow `next` into the definition body - -On `Exit(ref_id)`: - -1. Verify top frame matches `ref_id` (invariant: mismatched ref_id indicates IR bug) -2. Pop frame -3. Continue to `next` successors unconditionally - -**Entry filtering mechanism**: After epsilon elimination, multiple `Exit` transitions with different `RefId`s may be reachable from the same point (merged from different call sites). The interpreter only takes an `Exit(ref_id)` transition if `ref_id` matches the current stack top. This ensures returns go to the correct call site. - -After taking an `Exit` and popping the frame, successors are followed unconditionally—they represent the continuation after the call. If a successor has an `Enter` marker, that's a _new_ call (e.g., `(A) (B)` where returning from A continues to calling B), not a return path. - -```rust -/// Return stack entry for definition calls -struct Frame { - ref_id: RefId, // which call site we're inside - cursor_checkpoint: usize, // cursor position before call - effect_stream_watermark: usize, // effect count before call -} - -/// Runtime query executor -struct Interpreter<'a> { - graph: &'a TransitionGraph, - return_stack: Vec, // call stack for definition references - cursor: TreeCursor<'a>, // current position in AST - effect_stream: EffectStream<'a>, // effect accumulator -} -``` - -### Epsilon Elimination (Optimization) - -After initial construction, epsilon transitions can be **partially** eliminated by computing epsilon closures. Full elimination is not always possible due to the single `ref_marker` limitation—sequences like `Enter(A) → Enter(B)` cannot be merged into one transition. The `pre_effects`/`post_effects` split is essential for correctness here. - -**Why the split matters**: A match transition overwrites `current` with the matched node. Effects from _preceding_ epsilon transitions (like `PushElement`) need the _previous_ `current` value. Without the split, merging them into a single post-match list would use the wrong value. - -``` -Before (raw graph): -T1: Match(A) next: [T2] // current = A -T2: ε + PushElement next: [T3] // pushes A (correct) -T3: Match(B) next: [...] // current = B - -After elimination (with split): -T3': pre: [PushElement], Match(B), post: [] // PushElement runs before Match(B), pushes A ✓ - -Wrong (without split, effects merged as post): -T3': Match(B) + [PushElement] // PushElement runs after Match(B), pushes B ✗ -``` - -**Accumulation rules**: - -- `EffectOp`s from incoming epsilon paths → accumulate into `pre_effects` -- `EffectOp`s from outgoing epsilon paths → accumulate into `post_effects` - -This is why both are `Slice` rather than `Option`. - -**Reference expansion**: For definition references, epsilon elimination propagates `Enter`/`Exit` markers to surviving transitions: - -``` -Before: -T0: ε next: [T1] -T1: ε + Enter(0) next: [T2] // enter Expr -T2: ... (Expr body) ... next: [T3] -T3: ε + Exit(0) next: [T4] // exit Expr -T4: ε next: [T5] - -After: -T0': Match(...) + Enter(0) next: [T2'] // marker propagated -T3': Match(...) + Exit(0) next: [T5'] // marker propagated -``` - -All expanded entry transitions share the same `RefId`. All expanded exit transitions share the same `RefId`. The engine filters valid continuations at runtime based on stack state—no explicit continuation storage needed. - -**Limitation**: Complete epsilon elimination is impossible when reference markers chain (e.g., nested calls). The single `ref_marker` slot prevents merging `Enter(A) → Enter(B)` sequences. These remain as epsilon transition chains in the final graph. - -This optimization benefits both modes: - -- **Proc macro**: Fewer transitions → less generated code (where elimination is possible) -- **Dynamic**: Fewer graph traversals → faster interpretation (but must handle remaining epsilons) - -### Proc Macro Code Generation - -When used as a proc macro, the transition graph is a compile-time artifact: - -1. Parses query source at compile time -2. Builds transition graph (Thompson-style construction) -3. Optionally eliminates epsilons -4. Generates Rust functions for each definition - -Generated code uses: - -- `if`/`else` chains for alternations -- `while` loops for quantifiers -- Direct function calls for definition references -- `TreeCursor` navigation methods -- `descendant_index()`/`goto_descendant()` for backtracking - -At runtime, there is no graph—just plain Rust code. - -#### Direct Construction (No Effect Stream) - -Unlike the dynamic interpreter, proc-macro generated code constructs output values directly—no intermediate effect stream. Output structs are built in a single pass as matching proceeds. - -Backtracking in direct construction means dropping partially-built values and re-allocating. This is acceptable because modern allocators maintain thread-local free lists, making the alloc→drop→alloc pattern for small objects essentially O(1). - -### Dynamic Execution - -When used dynamically, the transition graph is interpreted at runtime: - -1. Parses query source at runtime -2. Builds transition graph -3. Optionally eliminates epsilons (can be skipped for faster startup) -4. Interpreter walks the graph, executing transitions - -The interpreter maintains: - -- Current transition pointer -- Explicit return stack for definition calls -- Cursor position -- `RuntimeEffect` stream with watermarks - -Unlike proc-macro codegen, the dynamic interpreter uses the `RuntimeEffect` stream approach. This is necessary because: - -- We don't know the output structure at compile time -- `RuntimeEffect` stream provides a uniform way to build any output shape -- Backtracking via `truncate()` is simple and correct - -Trade-off: More flexible (runtime query construction), but slower than generated code due to interpretation overhead and the extra effect execution pass. - -## Execution Mode Comparison - -| Aspect | Proc Macro | Dynamic | -| ----------------- | -------------------------- | ----------------------------- | -| Query source | Compile-time literal | Runtime string | -| Graph lifetime | Compile-time only | Runtime | -| Data construction | Direct (no effect stream) | `RuntimeEffect` stream + exec | -| Definition calls | Rust function calls | Explicit return stack | -| Return stack | Rust call stack | `Vec` | -| Backtracking | Drop + re-alloc | `truncate()` effects | -| Performance | Zero dispatch, single pass | Interpretation + 2 pass | -| Type safety | Compile-time checked | Runtime types | -| Use case | Known queries | User-provided queries | - -## Consequences - -### Positive - -- **Shared IR**: One representation serves both execution modes -- **Proc macro zero-overhead**: Generated code is plain Rust with no dispatch -- **Pre-allocated graph**: Single contiguous allocation -- **Dynamic flexibility**: Queries can be constructed or modified at runtime -- **Optimizable**: Epsilon elimination benefits both modes -- **Multiple entry points**: Same graph supports querying any definition -- **Clean separation**: `EffectOp` (static instructions) vs `RuntimeEffect` (dynamic events) eliminates lifetime issues - -### Negative - -- **Two code paths**: Must maintain both codegen and interpreter -- **Different data construction**: Proc macro uses direct construction, dynamic uses `RuntimeEffect` stream -- **Proc macro compile cost**: Complex queries generate more code -- **Dynamic runtime cost**: Interpretation overhead + effect execution pass -- **Testing burden**: Must verify both modes produce identical results - -### Runtime Safety - -Both execution modes require fuel mechanisms to prevent runaway execution: - -- **runtime_fuel**: Decremented on each transition, prevents infinite loops -- **recursion_fuel**: Decremented on each `Enter` marker, prevents stack overflow - -These mechanisms deserve their own ADR (fuel budget design, configurable limits, error reporting on exhaustion). The IR itself carries no fuel-related data—fuel checking is purely an interpreter/codegen concern. - -**Note**: Static loop detection (e.g., direct recursion like `A = (A)` or mutual recursion like `A = (B)`, `B = (A)`) is handled at the query parser level before IR construction. The IR assumes well-formed input without infinite loops in the pattern structure itself. - -### WASM Compatibility - -The IR design is WASM-compatible: - -- **Single arena allocation**: No fragmentation concerns in linear memory. Note: WASM linear memory grows in 64KB pages; the arena coexists with other allocations (e.g., tree-sitter's memory) but this is standard for any WASM allocation. -- **Explicit alignment**: Arena allocated with `std::alloc::Layout`, segment offsets computed with `align_up()`. Prevents misaligned access traps on WASM and strict ARM. -- **`u32` offsets**: All segment offsets are `u32`, matching WASM32's pointer size. 4GB arena limit is sufficient for any query. -- **`BTreeMap` for objects**: Deterministic iteration order ensures reproducible output across platforms. -- **Fixed-size Entrypoints**: The `Entrypoint` struct (12 bytes, align 4) avoids variable-length inline strings that would cause alignment hazards. -- **No platform-specific primitives**: All types are portable (`u16`, `u32`, byte arrays). -- **Allocator Independence**: Uses `std::alloc::alloc` via `Layout`. On `wasm32-unknown-unknown`, this defaults to the system allocator. Implementers targeting other environments (e.g., Emscripten) must ensure a global allocator is configured. - -#### Serialization Format - -The arena uses a simple binary format for caching compiled queries to disk. The current scope is limited to same-machine, same-version usage (e.g., caching a compiled query between CLI invocations). Cross-architecture portability and version migration are explicitly out of scope for this ADR and will be addressed in future work if needed. - -- **Validation**: The `magic` bytes must be `b"PLNK"`. The `version` field must match the exact compiler version AND platform ABI hash (pointer width + endianness). Any mismatch invalidates the cache. -- **Byte order**: Native (little-endian on x86/ARM/WASM). No byte-swapping is performed. -- **String encoding**: UTF-8 for all string data (entrypoint names, data field names, variant tags). -- **Layout**: Header followed by raw arena bytes: - -``` -Header (16 bytes): - magic: [u8; 4] // "PLNK" - version: u32 // format version (must match exactly) - arena_len: u32 // byte length of arena data - segment_count: u32 // number of segment offset entries - -Segment Offsets (segment_count × 4 bytes): - [u32; segment_count] // successors_offset, effects_offset, ... - -Arena Data (arena_len bytes): - [u8; arena_len] // raw arena bytes, used directly without fixup -``` - -**Loading**: The loader verifies magic, version, and arena length. If any mismatch occurs, the cache is invalidated and the query is recompiled. No byte-swapping or layout fixup is performed—mismatched architectures simply trigger recompilation. - -### Considered Alternatives - -1. **Proc macro only** - - Rejected: Need runtime query support for tooling and user-defined queries - -2. **Dynamic only** - - Rejected: Unacceptable performance overhead for known queries - -3. **Separate IRs for each mode** - - Rejected: Duplication; harder to ensure semantic equivalence - -4. **State-centric graph representation** - - Rejected: States carry no semantic weight; edge-centric is simpler - -5. **Vectorized Reference Markers (`Vec`)** - - Rejected: Optimized for alias chains (e.g. `A = B`, `B = C`) to allow full epsilon elimination. However, this bloats the `Transition` struct for all other cases. Standard epsilon elimination is sufficient; traversing a few remaining epsilon transitions for aliases is cheaper than increasing memory pressure on the whole graph. - -6. **Portable binary format** - - Deferred: Cross-architecture serialization would require byte-swapping and layout fixups. Current scope is same-machine caching only; portability can be added later if needed. - -## References - -- Bazaco, D. (2022). "Building a Regex Engine" blog series. https://www.abstractsyntaxseed.com/blog/regex-engine/introduction — NFA construction and modern regex features -- Tree-sitter TreeCursor API: `descendant_index()`, `goto_descendant()` -- [ADR-0001: Query Parser](ADR-0001-query-parser.md) diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index 01f84416..d6e3da95 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -2,7 +2,7 @@ - **Status**: Accepted - **Date**: 2025-12-12 -- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) +- **Supersedes**: Parts of ADR-0003 ## Context diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index d03f02a0..8da2807b 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -2,7 +2,7 @@ - **Status**: Accepted - **Date**: 2025-12-12 -- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) +- **Supersedes**: Parts of ADR-0003 ## Context diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index 1686133b..160deaca 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -2,7 +2,7 @@ - **Status**: Accepted - **Date**: 2025-12-12 -- **Supersedes**: Parts of [ADR-0003](ADR-0003-query-intermediate-representation.md) +- **Supersedes**: Parts of ADR-0003 ## Context