From b20c80083007ac4bd96dbe2511c47e2597ad59f2 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 21:32:15 -0300 Subject: [PATCH 1/7] docs: Small fixes for ADR-0003 --- AGENTS.md | 4 ++ ...-0003-query-intermediate-representation.md | 47 ++++++++++++------- 2 files changed, 35 insertions(+), 16 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 92b0b5d9..de3f0bcb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -13,6 +13,10 @@ - **Location**: `docs/adr/` - **Naming**: `ADR-XXXX-short-title-in-kebab-case.md` (`XXXX` is a sequential number). +- **Index**: + - [ADR-0001: Query Parser](docs/adr/ADR-0001-query-parser.md) + - [ADR-0002: Diagnostics System](docs/adr/ADR-0002-diagnostics-system.md) + - [ADR-0003: Query Intermediate Representation](docs/adr/ADR-0003-query-intermediate-representation.md) - **Template**: ```markdown diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index 284c6b75..cb4cf898 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -50,14 +50,16 @@ These structures are used by both execution modes. ```rust struct TransitionGraph { transitions: Vec, - data_fields: Vec, // DataFieldId → data field + data_fields: Vec, // DataFieldId → field name + variant_tags: Vec, // VariantTagId → tag name entrypoints: Vec<(String, TransitionId)>, default_entrypoint: TransitionId, } -type TransitionId = usize; // position in transitions array (structural) -type DataFieldId = usize; // index into FieldNames -type RefId = usize; // unique per each named subquery reference (Ref node in the query AST) +type TransitionId = usize; // position in transitions array (structural) +type DataFieldId = usize; // index into data_fields +type VariantTagId = usize; // index into variant_tags +type RefId = usize; // unique per each named subquery reference (Ref node in the query AST) ``` Each named definition has an entry point. The default entry is the last definition. Multiple entry points share the same transition graph. @@ -109,15 +111,15 @@ Navigation variants `Down`/`Up` move the cursor without matching. They enable ne ```rust enum Effect { - StartArray, // push new [] onto container stack - PushElement, // move current value into top array - EndArray, // pop array from stack, becomes current - StartObject, // push new {} onto container stack - EndObject, // pop object from stack, becomes current + StartArray, // push new [] onto container stack + PushElement, // move current value into top array + EndArray, // pop array from stack, becomes current + StartObject, // push new {} onto container stack + EndObject, // pop object from stack, becomes current Field(DataFieldId), // move current value into field on top object - StartVariant(DataFieldId), // push new variant (tagged) onto container stack - EndVariant, // pop variant from stack, becomes current - ToString, // convert current Node value to String (source text) + StartVariant(VariantTagId), // push variant tag onto container stack + EndVariant, // pop variant from stack, wrap current, becomes current + ToString, // convert current Node value to String (source text) } ``` @@ -173,13 +175,13 @@ enum Value<'a> { String(String), // Text values (from @capture :: string) Array(Vec>), // completed array Object(HashMap>), // completed object - Variant(DataFieldId, Box>), // tagged variant (tag + payload) + Variant(VariantTagId, Box>), // tagged variant (tag + payload) } enum Container<'a> { Array(Vec>), // array under construction - Object(HashMap>), // object under construction - Variant(DataFieldId, Box>), // variant under construction + Object(HashMap>), // object under construction + Variant(VariantTagId), // variant tag; EndVariant wraps current value } ``` @@ -201,7 +203,7 @@ Query: ``` Func = (function_declaration name: (identifier) @name - parameters: (parameters (identifier)* @params)) + parameters: (parameters (identifier)* @params :: string)) ``` Input: `function foo(a, b) {}` @@ -233,6 +235,7 @@ Execution trace: | Field("name") | - | [{name: Node(foo)}] | | StartArray | - | [{name:...}, []] | | (match "a") | Node(a) | [{name:...}, []] | +| ToString | String("a") | [{name:...}, []] | | PushElement | - | [{name:...}, [String("a")]] | | (match "b") | Node(b) | [{name:...}, [String("a")]] | | ToString | String("b") | [{name:...}, [String("a")]] | @@ -342,6 +345,18 @@ EndVariant The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. When serialized to JSON, it flattens to match the documented data model: `{ tag: "A", ...payload }`. +**Constraint: branches must produce objects.** Top-level quantifiers in tagged branches are disallowed: + +``` +// Invalid: branch A has top-level quantifier, produces array not object +[A: (foo (bar) @x)* B: (baz) @y] + +// Valid: wrap quantifier in a sequence with capture +[A: { (foo (bar) @x)* } @items B: (baz) @y] +``` + +Flattening requires object payloads (`{ tag: "A", ...payload }`). Arrays cannot be spread into objects. This constraint is enforced during query validation; the diagnostic suggests wrapping with `{ ... } @name`. + ### Definition References and Recursion When a pattern references another definition (e.g., `(Expr)` inside `Binary`), the IR uses `RefId` to identify the call site. Each `Ref` node in the query AST gets a unique `RefId`, which is preserved through epsilon elimination. From cc1bc89617a401e7d4382a3940970ca8c44e59d6 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 21:52:12 -0300 Subject: [PATCH 2/7] docs: Separate pre_effects and post_effects in ADR-0003 --- ...-0003-query-intermediate-representation.md | 97 +++++++++++-------- 1 file changed, 57 insertions(+), 40 deletions(-) diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index cb4cf898..b21abda5 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -68,12 +68,13 @@ Each named definition has an entry point. The default entry is the last definiti ```rust struct Transition { - matcher: Option, // None = epsilon (no node consumed) - pre_anchored: bool, // must match at current position, no scanning - post_anchored: bool, // after match, cursor must be at last sibling - effects: Vec, // data construction ops emitted on success + matcher: Option, // None = epsilon (no node consumed) + pre_anchored: bool, // must match at current position, no scanning + post_anchored: bool, // after match, cursor must be at last sibling + pre_effects: Vec, // effects before match (consume previous current) + post_effects: Vec, // effects after match (consume new current) ref_marker: Option, // call boundary marker - next: Vec, // successors; order = priority (first = greedy) + next: Vec, // successors; order = priority (first = greedy) } enum RefTransition { @@ -189,12 +190,13 @@ enum Container<'a> { For any given transition, the execution order is strict to ensure data consistency during backtracking: -1. **Match**: Validate node kind/fields. If fail, abort. -2. **Enter**: Push `Frame` with current `builder.watermark()`. -3. **Effects**: Emit new effects (committed tentatively). -4. **Exit**: Pop `Frame` (validate return). +1. **Enter**: Push `Frame` with current `builder.watermark()`. +2. **Pre-Effects**: Emit `pre_effects` (uses previous `current` value). +3. **Match**: Validate node kind/fields. If fail, rollback to watermark and abort. +4. **Post-Effects**: Emit `post_effects` (uses new `current` value). +5. **Exit**: Pop `Frame` (validate return). -This order ensures that if a definition call succeeds, its effects are present. If it fails later, the watermark saved during `Enter` allows rolling back all effects emitted by that definition. +This order ensures correct behavior during epsilon elimination. Pre-effects run before the match overwrites `current`, allowing effects like `PushElement` to be safely merged from preceding epsilon transitions. Post-effects run after, for effects that need the newly matched node. #### Example @@ -208,24 +210,26 @@ Func = (function_declaration Input: `function foo(a, b) {}` -Effect stream: +Effect stream (annotated with pre/post classification): ``` -StartObject - (match "foo") - Field("name") - StartArray - (match "a") - ToString - PushElement - (match "b") - ToString - PushElement - EndArray - Field("params") -EndObject +pre: StartObject + (match "foo") +post: Field("name") +pre: StartArray + (match "a") +post: ToString +post: PushElement + (match "b") +post: ToString +post: PushElement +post: EndArray +post: Field("params") +post: EndObject ``` +Note: In the raw graph, effects live on epsilon transitions between matches. The pre/post classification determines where they land after epsilon elimination. `StartObject` and `StartArray` are pre-effects (setup before matching). `Field`, `PushElement`, `ToString`, and `End*` are post-effects (consume the matched node or finalize containers). + Execution trace: | Effect | current | stack | @@ -304,14 +308,16 @@ Same structure, different `next` order. The first successor has priority. Array construction uses epsilon transitions with effects: ``` -T0: ε + StartArray next: [T1] -T1: ε (branch) next: [T2, T5] // try match or exit +T0: ε + StartArray next: [T1] // pre-effect: setup array +T1: ε (branch) next: [T2, T4] // try match or exit T2: Match(expr) next: [T3] -T3: ε + PushElement next: [T1] // loop back -T4: ε + EndArray next: [T5] -T5: ε + Field("items") next: [...] +T3: ε + PushElement next: [T1] // post-effect: consume matched node +T4: ε + EndArray next: [T5] // post-effect: finalize array +T5: ε + Field("items") next: [...] // post-effect: assign to field ``` +After epsilon elimination, `PushElement` from T3 merges into T2 as a post-effect. `StartArray` from T0 merges into T2 as a pre-effect (first iteration only—loop iterations enter from T3, not T0). + Backtracking naturally handles partial arrays: truncating the effect stream removes uncommitted `PushElement` effects. ### Scopes @@ -319,12 +325,14 @@ Backtracking naturally handles partial arrays: truncating the effect stream remo Nested objects from `{...} @name` use `StartObject`/`EndObject` effects: ``` -T0: ε + StartObject next: [T1] +T0: ε + StartObject next: [T1] // pre-effect: setup object T1: ... (sequence contents) next: [T2] -T2: ε + EndObject next: [T3] -T3: ε + Field("name") next: [...] +T2: ε + EndObject next: [T3] // post-effect: finalize object +T3: ε + Field("name") next: [...] // post-effect: assign to field ``` +`StartObject` is a pre-effect (merges forward). `EndObject` and `Field` are post-effects (merge backward onto preceding match). + ### Tagged Alternations Tagged branches use `StartVariant` to create explicit tagged structures. @@ -420,19 +428,28 @@ struct Interpreter<'a> { ### Epsilon Elimination (Optimization) -After initial construction, epsilon transitions can be eliminated by computing epsilon closures: +After initial construction, epsilon transitions can be eliminated by computing epsilon closures. The `pre_effects`/`post_effects` split is essential for correctness here. + +**Why the split matters**: A match transition overwrites `current` with the matched node. Effects from *preceding* epsilon transitions (like `PushElement`) need the *previous* `current` value. Without the split, merging them into a single post-match list would use the wrong value. ``` -Before: -T0: ε + StartArray next: [T1] -T1: ε + Field next: [T2] -T2: Match(kind) next: [T3] +Before (raw graph): +T1: Match(A) next: [T2] // current = A +T2: ε + PushElement next: [T3] // pushes A (correct) +T3: Match(B) next: [...] // current = B -After: -T0': Match(kind) + [StartArray, Field] next: [T3'] +After elimination (with split): +T3': pre: [PushElement], Match(B), post: [] // PushElement runs before Match(B), pushes A ✓ + +Wrong (without split, effects merged as post): +T3': Match(B) + [PushElement] // PushElement runs after Match(B), pushes B ✗ ``` -Effects from eliminated epsilons accumulate on the surviving match transition. This is why `effects` is `Vec` rather than `Option`. +**Accumulation rules**: +- Effects from incoming epsilon paths → accumulate into `pre_effects` +- Effects from outgoing epsilon paths → accumulate into `post_effects` + +This is why both are `Vec` rather than `Option`. **Reference expansion**: For definition references, epsilon elimination propagates `Enter`/`Exit` markers to surviving transitions: From 0049291ca02999796a58bccc022f777866e77193 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 21:53:56 -0300 Subject: [PATCH 3/7] fix --- docs/adr/ADR-0003-query-intermediate-representation.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index b21abda5..54c4edad 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -430,7 +430,7 @@ struct Interpreter<'a> { After initial construction, epsilon transitions can be eliminated by computing epsilon closures. The `pre_effects`/`post_effects` split is essential for correctness here. -**Why the split matters**: A match transition overwrites `current` with the matched node. Effects from *preceding* epsilon transitions (like `PushElement`) need the *previous* `current` value. Without the split, merging them into a single post-match list would use the wrong value. +**Why the split matters**: A match transition overwrites `current` with the matched node. Effects from _preceding_ epsilon transitions (like `PushElement`) need the _previous_ `current` value. Without the split, merging them into a single post-match list would use the wrong value. ``` Before (raw graph): @@ -446,6 +446,7 @@ T3': Match(B) + [PushElement] // PushElement runs after Match( ``` **Accumulation rules**: + - Effects from incoming epsilon paths → accumulate into `pre_effects` - Effects from outgoing epsilon paths → accumulate into `post_effects` From 9e87cb3e0774718d373ec80af76b4b0aea510f23 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 22:12:38 -0300 Subject: [PATCH 4/7] fixes --- README.md | 10 ++++----- docs/REFERENCE.md | 20 ++++++++--------- ...-0003-query-intermediate-representation.md | 22 +++++++++++-------- 3 files changed, 28 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index da6a4173..0ccb9106 100644 --- a/README.md +++ b/README.md @@ -161,12 +161,12 @@ This produces: ```typescript type Statement = - | { tag: "Assign"; target: string; value: Expression } - | { tag: "Call"; func: string; args: Expression[] }; + | { $tag: "Assign"; target: string; value: Expression } + | { $tag: "Call"; func: string; args: Expression[] }; type Expression = - | { tag: "Ident"; name: string } - | { tag: "Num"; value: string }; + | { $tag: "Ident"; name: string } + | { $tag: "Num"; value: string }; type TopDefinitions = { statements: [Statement, ...Statement[]]; @@ -177,7 +177,7 @@ Then process the results: ```typescript for (const stmt of result.statements) { - switch (stmt.tag) { + switch (stmt.$tag) { case "Assign": console.log(`Assignment to ${stmt.target}`); break; diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index da3c05a4..e1bc5ab2 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -589,10 +589,10 @@ Labels create a discriminated union: ] @stmt :: Stmt ``` -Output type (discriminant is always `tag`): +Output type (discriminant is always `$tag`): ```typescript -type Stmt = { tag: "Assign"; left: Node } | { tag: "Call"; func: Node }; +type Stmt = { $tag: "Assign"; left: Node } | { $tag: "Call"; func: Node }; ``` In Rust, tagged alternations become enums: @@ -754,8 +754,8 @@ Output type: ```typescript type MemberChain = - | { tag: "Base"; name: Node } - | { tag: "Access"; object: MemberChain; property: Node }; + | { $tag: "Base"; name: Node } + | { $tag: "Access"; object: MemberChain; property: Node }; ``` --- @@ -787,14 +787,14 @@ Output types: ```typescript type Statement = - | { tag: "Assign"; target: string; value: Expression } - | { tag: "Call"; func: string; args: Expression[] } - | { tag: "Return"; value?: Expression }; + | { $tag: "Assign"; target: string; value: Expression } + | { $tag: "Call"; func: string; args: Expression[] } + | { $tag: "Return"; value?: Expression }; type Expression = - | { tag: "Ident"; name: string } - | { tag: "Num"; value: string } - | { tag: "Str"; value: string }; + | { $tag: "Ident"; name: string } + | { $tag: "Num"; value: string } + | { $tag: "Str"; value: string }; type Root = { statements: [Statement, ...Statement[]]; diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index 54c4edad..9a3d3633 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -351,19 +351,23 @@ EndObject EndVariant ``` -The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. When serialized to JSON, it flattens to match the documented data model: `{ tag: "A", ...payload }`. +The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. -**Constraint: branches must produce objects.** Top-level quantifiers in tagged branches are disallowed: +**JSON serialization** depends on payload type: -``` -// Invalid: branch A has top-level quantifier, produces array not object -[A: (foo (bar) @x)* B: (baz) @y] +- **Object payload**: Flatten fields into the tagged object. + ```json + { "$tag": "A", "x": 1, "y": 2 } + ``` +- **Array/Primitive payload**: Wrap in a `content` field. + ```json + { "$tag": "A", "content": [1, 2, 3] } + { "$tag": "B", "content": "foo" } + ``` -// Valid: wrap quantifier in a sequence with capture -[A: { (foo (bar) @x)* } @items B: (baz) @y] -``` +The `$tag` key avoids collisions with user-defined `@tag` captures. -Flattening requires object payloads (`{ tag: "A", ...payload }`). Arrays cannot be spread into objects. This constraint is enforced during query validation; the diagnostic suggests wrapping with `{ ... } @name`. +This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. ### Definition References and Recursion From d68691142fca65de3c40db8fc18e047edb0e1080 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 22:20:42 -0300 Subject: [PATCH 5/7] Update ADR-0003 with revised tagged variant representation Add `$data` field for array/primitive payloads instead of `content` --- docs/adr/ADR-0003-query-intermediate-representation.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index 9a3d3633..4d96f868 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -359,13 +359,13 @@ The resulting `Value::Variant` preserves the tag distinct from the payload, prev ```json { "$tag": "A", "x": 1, "y": 2 } ``` -- **Array/Primitive payload**: Wrap in a `content` field. +- **Array/Primitive payload**: Wrap in a `$data` field. ```json - { "$tag": "A", "content": [1, 2, 3] } - { "$tag": "B", "content": "foo" } + { "$tag": "A", "$data": [1, 2, 3] } + { "$tag": "B", "$data": "foo" } ``` -The `$tag` key avoids collisions with user-defined `@tag` captures. +The `$tag` and `$data` keys avoid collisions with user-defined captures. This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. From d45307c957549b8880982508027c7003babb9028 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Wed, 10 Dec 2025 22:28:51 -0300 Subject: [PATCH 6/7] Update documentation for tagged alternation design --- README.md | 14 +++++------ docs/REFERENCE.md | 25 +++++++++++-------- ...-0003-query-intermediate-representation.md | 22 +++++++--------- 3 files changed, 31 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 0ccb9106..88443a46 100644 --- a/README.md +++ b/README.md @@ -126,7 +126,7 @@ Plotnik extends Tree-sitter's query syntax with: - **Named expressions** for composition and reuse - **Recursion** for arbitrarily nested structures - **Type annotations** for precise output shapes -- **Tagged alternations** for discriminated unions +- **Alternations**: untagged for simplicity, tagged for precision (discriminated unions) ## Use cases @@ -161,12 +161,12 @@ This produces: ```typescript type Statement = - | { $tag: "Assign"; target: string; value: Expression } - | { $tag: "Call"; func: string; args: Expression[] }; + | { $tag: "Assign"; $data: { target: string; value: Expression } } + | { $tag: "Call"; $data: { func: string; args: Expression[] } }; type Expression = - | { $tag: "Ident"; name: string } - | { $tag: "Num"; value: string }; + | { $tag: "Ident"; $data: { name: string } } + | { $tag: "Num"; $data: { value: string } }; type TopDefinitions = { statements: [Statement, ...Statement[]]; @@ -179,10 +179,10 @@ Then process the results: for (const stmt of result.statements) { switch (stmt.$tag) { case "Assign": - console.log(`Assignment to ${stmt.target}`); + console.log(`Assignment to ${stmt.$data.target}`); break; case "Call": - console.log(`Call to ${stmt.func} with ${stmt.args.length} args`); + console.log(`Call to ${stmt.$data.func} with ${stmt.$data.args.length} args`); break; } } diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index e1bc5ab2..47ad6719 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -492,6 +492,9 @@ interface Section { Match one of several alternatives with `[...]`: +- **Untagged** (no labels): Simpler output, fields merge. Use when you only need the captured data. +- **Tagged** (with labels): Precise discriminated union. Use when you need to know which branch matched. + ``` [ (identifier) @@ -589,10 +592,12 @@ Labels create a discriminated union: ] @stmt :: Stmt ``` -Output type (discriminant is always `$tag`): +Output type (discriminant is always `$tag`, payload in `$data`): ```typescript -type Stmt = { $tag: "Assign"; left: Node } | { $tag: "Call"; func: Node }; +type Stmt = + | { $tag: "Assign"; $data: { left: Node } } + | { $tag: "Call"; $data: { func: Node } }; ``` In Rust, tagged alternations become enums: @@ -754,8 +759,8 @@ Output type: ```typescript type MemberChain = - | { $tag: "Base"; name: Node } - | { $tag: "Access"; object: MemberChain; property: Node }; + | { $tag: "Base"; $data: { name: Node } } + | { $tag: "Access"; $data: { object: MemberChain; property: Node } }; ``` --- @@ -787,14 +792,14 @@ Output types: ```typescript type Statement = - | { $tag: "Assign"; target: string; value: Expression } - | { $tag: "Call"; func: string; args: Expression[] } - | { $tag: "Return"; value?: Expression }; + | { $tag: "Assign"; $data: { target: string; value: Expression } } + | { $tag: "Call"; $data: { func: string; args: Expression[] } } + | { $tag: "Return"; $data: { value?: Expression } }; type Expression = - | { $tag: "Ident"; name: string } - | { $tag: "Num"; value: string } - | { $tag: "Str"; value: string }; + | { $tag: "Ident"; $data: { name: string } } + | { $tag: "Num"; $data: { value: string } } + | { $tag: "Str"; $data: { value: string } }; type Root = { statements: [Statement, ...Statement[]]; diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index 4d96f868..5e16db7d 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -353,19 +353,15 @@ EndVariant The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. -**JSON serialization** depends on payload type: - -- **Object payload**: Flatten fields into the tagged object. - ```json - { "$tag": "A", "x": 1, "y": 2 } - ``` -- **Array/Primitive payload**: Wrap in a `$data` field. - ```json - { "$tag": "A", "$data": [1, 2, 3] } - { "$tag": "B", "$data": "foo" } - ``` - -The `$tag` and `$data` keys avoid collisions with user-defined captures. +**JSON serialization** always uses `$data` wrapper for uniformity: + +```json +{ "$tag": "A", "$data": { "x": 1, "y": 2 } } +{ "$tag": "B", "$data": [1, 2, 3] } +{ "$tag": "C", "$data": "foo" } +``` + +The `$tag` and `$data` keys avoid collisions with user-defined captures. Uniform structure simplifies parsing (always access `.$data`) and eliminates conditional flatten-vs-wrap logic. This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. From 9b8aed514172b461a68e4753085e5bb292ab3894 Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Thu, 11 Dec 2025 10:46:41 -0300 Subject: [PATCH 7/7] docs: Update ADR-0003 --- AGENTS.md | 2 +- README.md | 4 +- ...-0003-query-intermediate-representation.md | 525 ++++++++++++------ 3 files changed, 368 insertions(+), 163 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index de3f0bcb..053e77b4 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -133,7 +133,7 @@ Boolean = [ 5. **Alternations** - Tagged: `[ L1: (a) @x L2: (b) @y ]` - → Discriminated Union: `{ tag: "L1", x: Node } | { tag: "L2", y: Node }`. + → Discriminated Union: `{ "$tag": "L1", "$data": { x: Node } } | { "$tag": "L2", "$data": { y: Node } }`. - Untagged: `[ (a) @x (b) @x ]` → Merged Struct: `{ x: Node }`. Captures must be type-compatible across branches. - Mixed: `[ (a) @x (b) ]` (invalid) - the diagnostics will be reported, but we infer as for untagged diff --git a/README.md b/README.md index 88443a46..23a577a3 100644 --- a/README.md +++ b/README.md @@ -182,7 +182,9 @@ for (const stmt of result.statements) { console.log(`Assignment to ${stmt.$data.target}`); break; case "Call": - console.log(`Call to ${stmt.$data.func} with ${stmt.$data.args.length} args`); + console.log( + `Call to ${stmt.$data.func} with ${stmt.$data.args.length} args`, + ); break; } } diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md index 5e16db7d..cd2d2f16 100644 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ b/docs/adr/ADR-0003-query-intermediate-representation.md @@ -47,71 +47,206 @@ These structures are used by both execution modes. #### Transition Graph Container +The graph is immutable after construction. We use a single contiguous allocation sliced into typed segments with proper alignment handling. + ```rust struct TransitionGraph { - transitions: Vec, - data_fields: Vec, // DataFieldId → field name - variant_tags: Vec, // VariantTagId → tag name - entrypoints: Vec<(String, TransitionId)>, + data: Box<[u8]>, + // segment offsets (aligned for each type) + successors_offset: usize, + effects_offset: usize, + negated_fields_offset: usize, + data_fields_offset: usize, + variant_tags_offset: usize, + entrypoints_offset: usize, default_entrypoint: TransitionId, } -type TransitionId = usize; // position in transitions array (structural) -type DataFieldId = usize; // index into data_fields -type VariantTagId = usize; // index into variant_tags -type RefId = usize; // unique per each named subquery reference (Ref node in the query AST) +impl TransitionGraph { + fn new() -> Self; + fn get(&self, id: TransitionId) -> TransitionView<'_>; + fn entry(&self, name: &str) -> Option>; + fn default_entry(&self) -> TransitionView<'_>; + fn field_name(&self, id: DataFieldId) -> &str; + fn tag_name(&self, id: VariantTagId) -> &str; +} ``` -Each named definition has an entry point. The default entry is the last definition. Multiple entry points share the same transition graph. +##### Memory Arena Design + +The single `Box<[u8]>` allocation is divided into typed segments. Each segment is properly aligned for its type, ensuring safe access across all architectures (x86, ARM, RISC-V). + +**Segment Layout**: + +- Transitions: `[Transition; N]` at offset 0 +- Successors: `[TransitionId; M]` at `successors_offset` +- Effects: `[EffectOp; P]` at `effects_offset` +- Negated Fields: `[NodeFieldId; Q]` at `negated_fields_offset` +- Data Fields: `[u8; R]` (string data) at `data_fields_offset` +- Variant Tags: `[u8; S]` (string data) at `variant_tags_offset` +- Entrypoints: `[(name, TransitionId); T]` (length-prefixed strings) at `entrypoints_offset` + +Note: `entry(&str)` performs linear scan — O(n) where n = definition count (typically <20). + +The offsets are computed during graph construction to ensure: + +1. Each segment starts at its type's natural alignment boundary +2. No padding bytes are wasted between same-typed items +3. String data is stored as length-prefixed UTF-8 bytes -#### Transition +**Access Pattern**: + +The `TransitionView` and `MatcherView` types provide safe access by: + +- Resolving `Slice` handles to actual slices within the appropriate segment +- Converting relative indices to absolute pointers +- Hiding all offset arithmetic from the query engine + +This design achieves: + +- **Cache efficiency**: All graph data in one contiguous allocation +- **Memory efficiency**: No per-node allocations, minimal overhead +- **Type safety**: Phantom types ensure slices point to correct segments +- **Zero-copy**: Direct references into the arena, no cloning + +#### Transition View + +`TransitionView` bundles a graph reference with a transition, enabling ergonomic access without explicit slice resolution: ```rust -struct Transition { - matcher: Option, // None = epsilon (no node consumed) - pre_anchored: bool, // must match at current position, no scanning - post_anchored: bool, // after match, cursor must be at last sibling - pre_effects: Vec, // effects before match (consume previous current) - post_effects: Vec, // effects after match (consume new current) - ref_marker: Option, // call boundary marker - next: Vec, // successors; order = priority (first = greedy) +struct TransitionView<'a> { + graph: &'a TransitionGraph, + raw: &'a Transition, } -enum RefTransition { - Enter(RefId), // push ref_id onto return stack - Exit(RefId), // pop from return stack (must match ref_id) +impl<'a> TransitionView<'a> { + fn matcher(&self) -> Option>; + fn next(&self) -> impl Iterator>; + fn pre_effects(&self) -> &[EffectOp]; + fn post_effects(&self) -> &[EffectOp]; + fn is_pre_anchored(&self) -> bool; + fn is_post_anchored(&self) -> bool; + fn ref_marker(&self) -> Option<&RefTransition>; +} + +struct MatcherView<'a> { + graph: &'a TransitionGraph, + raw: &'a Matcher, +} + +impl<'a> MatcherView<'a> { + fn kind(&self) -> MatcherKind; + fn node_kind(&self) -> Option; + fn field(&self) -> Option; + fn negated_fields(&self) -> &[NodeFieldId]; // resolved from Slice + fn matches(&self, cursor: &TreeCursor) -> bool; } + +enum MatcherKind { Node, Anonymous, Wildcard, Down, Up } ``` -Thompson construction creates epsilon transitions with optional `Enter`/`Exit` markers. Epsilon elimination propagates these markers to surviving transitions. At runtime, the engine uses markers to filter which `next` transitions are valid based on return stack state. Multiple transitions can share the same `RefId` after epsilon elimination. +**Execution Flow**: + +The engine traverses transitions following this pattern: + +1. **Pre-effects** execute unconditionally before any matching attempt +2. **Matching** determines whether to proceed: + - With matcher: Test against current cursor position + - Without matcher (epsilon): Always proceed +3. **On successful match**: Implicitly capture the node, execute post-effects +4. **Successors** are processed recursively, with appropriate backtracking + +The `TransitionView` abstraction hides all segment access complexity. The same logical flow applies to both execution modes—dynamic interpretation emits effects while proc-macro generation produces direct construction code. + +#### Slice Handle + +A compact, relative reference to a contiguous range within a segment. Replaces `&[T]` to keep structs self-contained. + +```rust +#[repr(C)] +struct Slice { + start: u32, // Index within segment + len: u32, // Number of items + _phantom: PhantomData, +} + +impl Slice { + const EMPTY: Self = Self { start: 0, len: 0, _phantom: PhantomData }; +} +``` + +Size: 8 bytes. Using `u32` for both fields fills the natural alignment with no padding waste, supporting up to 4B items per slice—well beyond any realistic query. + +#### Raw Transition + +Internal storage. Engine code uses `TransitionView` instead of accessing this directly. + +```rust +struct Transition { + matcher: Option, + pre_anchored: bool, + post_anchored: bool, + pre_effects: Slice, + post_effects: Slice, + ref_marker: Option, + next: Slice, +} +``` + +The `TransitionView` resolves `Slice` using the graph's `get_slice` method, hiding all offset calculations from the engine code. + +**Design Note**: The `ref_marker` field is intentionally a single `Option` rather than a `Slice`. This means a transition can carry at most one Enter or Exit marker. While this prevents full epsilon elimination for nested reference sequences (e.g., `Enter(A) → Enter(B)`), we accept this limitation for simplicity. Such sequences remain as chains of epsilon transitions in the final graph. + +```rust +type TransitionId = u32; +type DataFieldId = u16; +type VariantTagId = u16; +type RefId = u16; +``` + +Each named definition has an entry point. The default entry is the last definition. Multiple entry points share the same transition graph. #### Matcher +Note: `NodeTypeId` and `NodeFieldId` are defined in `plotnik-core` (tree-sitter uses `u16` and `NonZeroU16` respectively). + ```rust enum Matcher { - // Matches named node like `identifier`, `function_declaration` Node { kind: NodeTypeId, - field: Option, // tree-sitter field constraint - negated_fields: Vec, // fields that must be absent + field: Option, + negated_fields: Slice, }, - // literal text: "(", "function", ";", etc., resolved to NodeTypeId Anonymous { kind: NodeTypeId, - field: Option, // tree-sitter field constraint + field: Option, }, - Wildcard, // matches any node - Down, // descend to first child - Up, // ascend to parent + Wildcard, + Down, + Up, } ``` Navigation variants `Down`/`Up` move the cursor without matching. They enable nested patterns like `(function_declaration (identifier) @name)` where we must descend into children. -#### Effects +#### Reference Markers + +```rust +enum RefTransition { + Enter(RefId), // push ref_id onto return stack + Exit(RefId), // pop from return stack (must match ref_id) +} +``` + +Thompson construction creates epsilon transitions with optional `Enter`/`Exit` markers. Epsilon elimination propagates these markers to surviving transitions. At runtime, the engine uses markers to filter which `next` transitions are valid based on return stack state. Multiple transitions can share the same `RefId` after epsilon elimination. + +#### Effect Operations + +Instructions stored in the transition graph. These are static, `Copy`, and contain no runtime data. ```rust -enum Effect { +#[derive(Clone, Copy)] +enum EffectOp { StartArray, // push new [] onto container stack PushElement, // move current value into top array EndArray, // pop array from stack, becomes current @@ -124,37 +259,48 @@ enum Effect { } ``` -Note: Match transitions set `current` to the matched node (not an effect). +Size: 4 bytes (1-byte discriminant + 2-byte payload + 1-byte padding). -Effects capture structure only—nodes, arrays, objects. Type annotations (`:: str`, `:: Type`) are separate metadata applied during post-processing when constructing the final output. +Note: There is no `CaptureNode` instruction. Node capture is implicit—a successful match automatically emits `RuntimeEffect::CaptureNode` to the builder (see below). -### Data Construction +Effects capture structure only—arrays, objects, variants. Type annotations (`:: str`, `:: Type`) are separate metadata applied during post-processing. -Effects emit to a linear stream during matching. After a successful match, the effect stream is executed to build the output. +### Data Construction (Dynamic Interpreter) -#### Builder +This section describes data construction for the dynamic interpreter. Proc-macro codegen uses direct construction instead (see [Direct Construction](#direct-construction-no-effect-stream)). + +The interpreter emits events to a linear stream during matching. After a successful match, the stream is executed to build the output. + +#### Runtime Effects + +Events emitted to the builder during interpretation. Unlike `EffectOp`, these carry runtime data. ```rust -/// Accumulates effects during matching; supports rollback on backtrack -struct Builder { - effects: Vec, +enum RuntimeEffect<'a> { + Op(EffectOp), // forwarded instruction from graph + CaptureNode(Node<'a>), // emitted implicitly on successful match } +``` -impl Builder { - fn emit(&mut self, effect: Effect) { - self.effects.push(effect); - } +The `CaptureNode` variant is never stored in the graph—it's generated by the interpreter when a match succeeds. This separation keeps the graph static (no lifetimes) while allowing the runtime stream to carry actual node references. - fn watermark(&self) -> usize { // save point for backtracking - self.effects.len() - } +#### Builder - fn rollback(&mut self, watermark: usize) { // discard effects after watermark - self.effects.truncate(watermark); - } +```rust +/// Accumulates runtime effects during matching; supports rollback on backtrack +struct Builder<'a> { + effects: Vec>, } ``` +The builder accumulates effects as a linear stream during matching. It provides: + +- **Effect emission**: Appends `EffectOp` instructions and `CaptureNode` events +- **Watermarking**: Records position before attempting branches +- **Rollback**: Truncates to saved position on backtrack + +This append-only design makes backtracking trivial—just truncate the vector. No complex undo logic needed. + #### Execution Model Two separate concepts during effect execution: @@ -175,29 +321,40 @@ enum Value<'a> { Node(Node<'a>), // AST node reference String(String), // Text values (from @capture :: string) Array(Vec>), // completed array - Object(HashMap>), // completed object + Object(BTreeMap>), // completed object (BTreeMap for deterministic iteration) Variant(VariantTagId, Box>), // tagged variant (tag + payload) } enum Container<'a> { Array(Vec>), // array under construction - Object(HashMap>), // object under construction + Object(BTreeMap>), // object under construction Variant(VariantTagId), // variant tag; EndVariant wraps current value } ``` +Effect semantics on `current`: + +- `CaptureNode(node)` → sets `current` to `Value::Node(node)` +- `Field(id)` → moves `current` into top object, clears to `None` +- `PushElement` → moves `current` into top array, clears to `None` +- `End*` → pops container from stack into `current` +- `ToString` → replaces `current` Node with its source text as String + #### Execution Pipeline For any given transition, the execution order is strict to ensure data consistency during backtracking: 1. **Enter**: Push `Frame` with current `builder.watermark()`. -2. **Pre-Effects**: Emit `pre_effects` (uses previous `current` value). +2. **Pre-Effects**: Emit `pre_effects` as `RuntimeEffect::Op(...)`. 3. **Match**: Validate node kind/fields. If fail, rollback to watermark and abort. -4. **Post-Effects**: Emit `post_effects` (uses new `current` value). -5. **Exit**: Pop `Frame` (validate return). +4. **Capture**: Emit `RuntimeEffect::CaptureNode(matched_node)` — implicit, not from graph. +5. **Post-Effects**: Emit `post_effects` as `RuntimeEffect::Op(...)`. +6. **Exit**: Pop `Frame` (validate return). This order ensures correct behavior during epsilon elimination. Pre-effects run before the match overwrites `current`, allowing effects like `PushElement` to be safely merged from preceding epsilon transitions. Post-effects run after, for effects that need the newly matched node. +The key insight: `CaptureNode` is generated by the interpreter on successful match, not stored as an instruction. The graph only contains structural operations (`EffectOp`); the runtime stream (`RuntimeEffect`) adds the actual node data. + #### Example Query: @@ -210,49 +367,49 @@ Func = (function_declaration Input: `function foo(a, b) {}` -Effect stream (annotated with pre/post classification): - -``` -pre: StartObject - (match "foo") -post: Field("name") -pre: StartArray - (match "a") -post: ToString -post: PushElement - (match "b") -post: ToString -post: PushElement -post: EndArray -post: Field("params") -post: EndObject -``` - -Note: In the raw graph, effects live on epsilon transitions between matches. The pre/post classification determines where they land after epsilon elimination. `StartObject` and `StartArray` are pre-effects (setup before matching). `Field`, `PushElement`, `ToString`, and `End*` are post-effects (consume the matched node or finalize containers). - -Execution trace: - -| Effect | current | stack | -| --------------- | ----------- | ---------------------------------------- | -| StartObject | - | [{}] | -| (match "foo") | Node(foo) | [{}] | -| Field("name") | - | [{name: Node(foo)}] | -| StartArray | - | [{name:...}, []] | -| (match "a") | Node(a) | [{name:...}, []] | -| ToString | String("a") | [{name:...}, []] | -| PushElement | - | [{name:...}, [String("a")]] | -| (match "b") | Node(b) | [{name:...}, [String("a")]] | -| ToString | String("b") | [{name:...}, [String("a")]] | -| PushElement | - | [{name:...}, [String("a"), String("b")]] | -| EndArray | [...] | [{name:...}] | -| Field("params") | - | [{name:..., params:[...]}] | -| EndObject | {...} | [] | +Runtime effect stream (showing `EffectOp` from graph vs implicit `CaptureNode`): + +``` +graph pre: Op(StartObject) +implicit: CaptureNode(foo) ← from successful match +graph post: Op(Field("name")) +graph pre: Op(StartArray) +implicit: CaptureNode(a) ← from successful match +graph post: Op(ToString) +graph post: Op(PushElement) +implicit: CaptureNode(b) ← from successful match +graph post: Op(ToString) +graph post: Op(PushElement) +graph post: Op(EndArray) +graph post: Op(Field("params")) +graph post: Op(EndObject) +``` + +Note: The graph stores only `EffectOp` instructions. `CaptureNode` events are generated by the interpreter on each successful match—they never appear in `Transition.pre_effects` or `Transition.post_effects`. + +In the raw graph, `EffectOp`s live on epsilon transitions between matches. The pre/post classification determines where they land after epsilon elimination. `StartObject` and `StartArray` are pre-effects (setup before matching). `Field`, `PushElement`, `ToString`, and `End*` are post-effects (consume the matched node or finalize containers). + +Execution trace (key steps, second array element omitted): + +| RuntimeEffect | current | stack | +| ------------------- | ---------- | --------------- | +| Op(StartObject) | - | [{}] | +| CaptureNode(foo) | Node(foo) | [{}] | +| Op(Field("name")) | - | [{name: Node}] | +| Op(StartArray) | - | [{...}, []] | +| CaptureNode(a) | Node(a) | [{...}, []] | +| Op(ToString) | "a" | [{...}, []] | +| Op(PushElement) | - | [{...}, ["a"]] | +| _(repeat for "b")_ | ... | ... | +| Op(EndArray) | ["a", "b"] | [{...}] | +| Op(Field("params")) | - | [{..., params}] | +| Op(EndObject) | {...} | [] | Final result: ```json { - "name": Node(foo), + "name": "", "params": ["a", "b"] } ``` @@ -265,55 +422,59 @@ Two mechanisms work together (same for both execution modes): 2. **Effect watermark**: `builder.watermark()` before attempting a branch; `builder.rollback(watermark)` on failure. -```rust -// This logic appears in both modes: -// - Proc macro: generated as literal Rust code -// - Dynamic: executed by the interpreter - -let cursor_checkpoint = cursor.descendant_index(); -let builder_watermark = builder.watermark(); +Both execution modes save state before attempting branches: -if try_first_branch(cursor, builder) { - return true; -} +- **Cursor checkpoint**: Current position in the AST (cheap to save, O(depth) to restore) +- **Builder watermark**: Current effect count (O(1) save and restore) -cursor.goto_descendant(cursor_checkpoint); -builder.rollback(builder_watermark); +The pattern is: attempt first branch, and on failure, restore both cursor and effects to their saved states before trying the next branch. This ensures each alternative starts from the same clean state. -try_second_branch(cursor, builder) ``` ### Quantifiers Quantifiers compile to epsilon transitions with specific `next` ordering: -**Greedy `*`/`+`**: +**Greedy `*`** (zero or more): ``` + Entry ─ε→ [try match first, then exit] - ↓ - Match ─ε→ loop back to Entry +↓ +Match ─ε→ loop back to Entry + ``` -**Non-greedy `*?`/`+?`**: +**Greedy `+`** (one or more): ``` -Entry ─ε→ [try exit first, then match] + + ┌──────────────────────────┐ + ↓ │ + +Entry ─→ Match ─ε→ Loop ─ε→ [try match first, then exit] + ``` -Same structure, different `next` order. The first successor has priority. +The `+` quantifier differs from `*`: it enters directly at `Match`, requiring at least one successful match before the exit path becomes available. After the first match, the `Loop` node behaves like `*` (match-first, exit-second). + +**Non-greedy `*?`/`+?`**: + +Same structures as above, but with reversed `next` ordering: exit path has priority over match path. For `+?`, after the mandatory first match, the loop prefers exiting over matching more. ### Arrays Array construction uses epsilon transitions with effects: ``` -T0: ε + StartArray next: [T1] // pre-effect: setup array -T1: ε (branch) next: [T2, T4] // try match or exit -T2: Match(expr) next: [T3] -T3: ε + PushElement next: [T1] // post-effect: consume matched node -T4: ε + EndArray next: [T5] // post-effect: finalize array -T5: ε + Field("items") next: [...] // post-effect: assign to field + +T0: ε + StartArray next: [T1] // pre-effect: setup array +T1: ε (branch) next: [T2, T4] // try match or exit +T2: Match(expr) next: [T3] +T3: ε + PushElement next: [T1] // post-effect: consume matched node +T4: ε + EndArray next: [T5] // post-effect: finalize array +T5: ε + Field("items") next: [...] // post-effect: assign to field + ``` After epsilon elimination, `PushElement` from T3 merges into T2 as a post-effect. `StartArray` from T0 merges into T2 as a pre-effect (first iteration only—loop iterations enter from T3, not T0). @@ -325,10 +486,12 @@ Backtracking naturally handles partial arrays: truncating the effect stream remo Nested objects from `{...} @name` use `StartObject`/`EndObject` effects: ``` -T0: ε + StartObject next: [T1] // pre-effect: setup object -T1: ... (sequence contents) next: [T2] -T2: ε + EndObject next: [T3] // post-effect: finalize object -T3: ε + Field("name") next: [...] // post-effect: assign to field + +T0: ε + StartObject next: [T1] // pre-effect: setup object +T1: ... (sequence contents) next: [T2] +T2: ε + EndObject next: [T3] // post-effect: finalize object +T3: ε + Field("name") next: [...] // post-effect: assign to field + ``` `StartObject` is a pre-effect (merges forward). `EndObject` and `Field` are post-effects (merge backward onto preceding match). @@ -338,18 +501,22 @@ T3: ε + Field("name") next: [...] // post-effect: assign to field Tagged branches use `StartVariant` to create explicit tagged structures. ``` + [ A: (true) ] + ``` Effect stream: ``` + StartVariant("A") StartObject ... EndObject EndVariant -``` + +```` The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. @@ -359,10 +526,16 @@ The resulting `Value::Variant` preserves the tag distinct from the payload, prev { "$tag": "A", "$data": { "x": 1, "y": 2 } } { "$tag": "B", "$data": [1, 2, 3] } { "$tag": "C", "$data": "foo" } -``` +```` The `$tag` and `$data` keys avoid collisions with user-defined captures. Uniform structure simplifies parsing (always access `.$data`) and eliminates conditional flatten-vs-wrap logic. +**Nested variants** (variant containing variant) serialize naturally: + +```json +{ "$tag": "Outer", "$data": { "$tag": "Inner", "$data": 42 } } +``` + This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. ### Definition References and Recursion @@ -385,18 +558,7 @@ The `RefId` is semantic identity—"which reference in the query pattern"—dist **Proc macro**: Each definition becomes a Rust function. References become function calls. Rust's call stack serves as the return stack—`RefId` is implicit in the call site. -```rust -// Generated code -fn match_expr(cursor: &mut TreeCursor, builder: &mut Builder) -> bool { - // ... alternation over Num, Binary, Call variants -} - -fn match_binary(cursor: &mut TreeCursor, builder: &mut Builder) -> bool { - // ... - if !match_expr(cursor, builder) { return false; } // RefId implicit - // ... -} -``` +In proc-macro mode, each definition becomes a Rust function. References become direct function calls, with the Rust call stack serving as the implicit return stack. The `RefId` exists only in the IR—the generated code relies on Rust's natural call/return mechanism. **Dynamic**: The interpreter maintains an explicit return stack. On `Enter(ref_id)`: @@ -405,9 +567,13 @@ fn match_binary(cursor: &mut TreeCursor, builder: &mut Builder) -> bool { On `Exit(ref_id)`: -1. Verify top frame matches `ref_id` -2. Filter `next` to only transitions reachable from the call site (same `ref_id` on their entry path) -3. Pop frame on successful exit +1. Verify top frame matches `ref_id` (invariant: mismatched ref_id indicates IR bug) +2. Pop frame +3. Continue to `next` successors unconditionally + +**Entry filtering mechanism**: The filtering happens when _entering_ an `Exit` transition, not when leaving it. After epsilon elimination, multiple `Exit` transitions with different `RefId`s may be reachable from the same point (merged from different call sites). The interpreter only takes an `Exit(ref_id)` transition if `ref_id` matches the current stack top. This ensures returns go to the correct call site. + +After taking an `Exit` and popping the frame, successors are followed unconditionally—they represent the continuation after the call. If a successor has an `Enter` marker, that's a _new_ call (e.g., `(A) (B)` where returning from A continues to calling B), not a return path. ```rust /// Return stack entry for definition calls @@ -428,7 +594,7 @@ struct Interpreter<'a> { ### Epsilon Elimination (Optimization) -After initial construction, epsilon transitions can be eliminated by computing epsilon closures. The `pre_effects`/`post_effects` split is essential for correctness here. +After initial construction, epsilon transitions can be **partially** eliminated by computing epsilon closures. Full elimination is not always possible due to the single `ref_marker` limitation—sequences like `Enter(A) → Enter(B)` cannot be merged into one transition. The `pre_effects`/`post_effects` split is essential for correctness here. **Why the split matters**: A match transition overwrites `current` with the matched node. Effects from _preceding_ epsilon transitions (like `PushElement`) need the _previous_ `current` value. Without the split, merging them into a single post-match list would use the wrong value. @@ -447,10 +613,10 @@ T3': Match(B) + [PushElement] // PushElement runs after Match( **Accumulation rules**: -- Effects from incoming epsilon paths → accumulate into `pre_effects` -- Effects from outgoing epsilon paths → accumulate into `post_effects` +- `EffectOp`s from incoming epsilon paths → accumulate into `pre_effects` +- `EffectOp`s from outgoing epsilon paths → accumulate into `post_effects` -This is why both are `Vec` rather than `Option`. +This is why both are `Slice` rather than `Option`. **Reference expansion**: For definition references, epsilon elimination propagates `Enter`/`Exit` markers to surviving transitions: @@ -469,10 +635,12 @@ T3': Match(...) + Exit(0) next: [T5'] // marker propagated All expanded entry transitions share the same `RefId`. All expanded exit transitions share the same `RefId`. The engine filters valid continuations at runtime based on stack state—no explicit continuation storage needed. +**Limitation**: Complete epsilon elimination is impossible when reference markers chain (e.g., nested calls). The single `ref_marker` slot prevents merging `Enter(A) → Enter(B)` sequences. These remain as epsilon transition chains in the final graph. + This optimization benefits both modes: -- **Proc macro**: Fewer transitions → less generated code -- **Dynamic**: Fewer graph traversals → faster interpretation +- **Proc macro**: Fewer transitions → less generated code (where elimination is possible) +- **Dynamic**: Fewer graph traversals → faster interpretation (but must handle remaining epsilons) ### Proc Macro Code Generation @@ -493,6 +661,12 @@ Generated code uses: At runtime, there is no graph—just plain Rust code. +#### Direct Construction (No Effect Stream) + +Unlike the dynamic interpreter, proc-macro generated code constructs output values directly—no intermediate effect stream. Output structs are built in a single pass as matching proceeds. + +Backtracking in direct construction means dropping partially-built values and re-allocating. This is acceptable because modern allocators maintain thread-local free lists, making the alloc→drop→alloc pattern for small objects essentially O(1). + ### Dynamic Execution When used dynamically, the transition graph is interpreted at runtime: @@ -507,22 +681,29 @@ The interpreter maintains: - Current transition pointer - Explicit return stack for definition calls - Cursor position -- Effect stream with watermarks +- `RuntimeEffect` stream with watermarks + +Unlike proc-macro codegen, the dynamic interpreter uses the `RuntimeEffect` stream approach. This is necessary because: + +- We don't know the output structure at compile time +- `RuntimeEffect` stream provides a uniform way to build any output shape +- Backtracking via `truncate()` is simple and correct -Trade-off: More flexible (runtime query construction), but slower than generated code. +Trade-off: More flexible (runtime query construction), but slower than generated code due to interpretation overhead and the extra effect execution pass. ## Execution Mode Comparison -| Aspect | Proc Macro | Dynamic | -| ---------------- | ---------------------- | ----------------------- | -| Query source | Compile-time literal | Runtime string | -| Graph lifetime | Compile-time only | Runtime | -| Definition calls | Rust function calls | Explicit return stack | -| Return stack | Rust call stack | `Vec` | -| Backtracking | Generated `if`/`else` | Interpreter loop | -| Performance | Zero dispatch overhead | Interpretation overhead | -| Type safety | Compile-time checked | Runtime types | -| Use case | Known queries | User-provided queries | +| Aspect | Proc Macro | Dynamic | +| ----------------- | -------------------------- | ---------------------------- | +| Query source | Compile-time literal | Runtime string | +| Graph lifetime | Compile-time only | Runtime | +| Data construction | Direct (no effect stream) | `RuntimeEffect` stream + exe | +| Definition calls | Rust function calls | Explicit return stack | +| Return stack | Rust call stack | `Vec` | +| Backtracking | Drop + re-alloc | `truncate()` effects | +| Performance | Zero dispatch, single pass | Interpretation + 2 pass | +| Type safety | Compile-time checked | Runtime types | +| Use case | Known queries | User-provided queries | ## Consequences @@ -530,18 +711,41 @@ Trade-off: More flexible (runtime query construction), but slower than generated - **Shared IR**: One representation serves both execution modes - **Proc macro zero-overhead**: Generated code is plain Rust with no dispatch +- **Pre-allocated graph**: Single contiguous allocation - **Dynamic flexibility**: Queries can be constructed or modified at runtime -- **Unified backtracking**: Same watermark mechanism for cursor and effects in both modes - **Optimizable**: Epsilon elimination benefits both modes - **Multiple entry points**: Same graph supports querying any definition +- **Clean separation**: `EffectOp` (static instructions) vs `RuntimeEffect` (dynamic events) eliminates lifetime issues ### Negative - **Two code paths**: Must maintain both codegen and interpreter +- **Different data construction**: Proc macro uses direct construction, dynamic uses `RuntimeEffect` stream - **Proc macro compile cost**: Complex queries generate more code -- **Dynamic runtime cost**: Interpretation overhead vs. generated code +- **Dynamic runtime cost**: Interpretation overhead + effect execution pass - **Testing burden**: Must verify both modes produce identical results +### Runtime Safety + +Both execution modes require fuel mechanisms to prevent runaway execution: + +- **runtime_fuel**: Decremented on each transition, prevents infinite loops +- **recursion_fuel**: Decremented on each `Enter` marker, prevents stack overflow + +These mechanisms deserve their own ADR (fuel budget design, configurable limits, error reporting on exhaustion). The IR itself carries no fuel-related data—fuel checking is purely an interpreter/codegen concern. + +**Note**: Static loop detection (e.g., direct recursion like `A = (A)` or mutual recursion like `A = (B)`, `B = (A)`) is handled at the query parser level before IR construction. The IR assumes well-formed input without infinite loops in the pattern structure itself. + +### WASM Compatibility + +The IR design is WASM-compatible: + +- **Single arena allocation**: No fragmentation concerns in linear memory +- **`usize` offsets**: 32-bit on WASM32, limiting arena to 4GB (sufficient for any query) +- **`BTreeMap` for objects**: Deterministic iteration order ensures reproducible output +- **Per-type alignment**: Segment offsets computed at construction time, respecting target alignment requirements +- **No platform-specific primitives**: All types are portable (`u16`, `u32`, `Box<[u8]>`) + ### Considered Alternatives 1. **Proc macro only** @@ -561,7 +765,6 @@ Trade-off: More flexible (runtime query construction), but slower than generated ## References -- Thompson, K. (1968). "Programming Techniques: Regular expression search algorithm." Communications of the ACM, 11(6), pp. 419-422. — fragment composition technique adapted here -- Woods, W. A. (1970). "Transition network grammars for natural language analysis." Communications of the ACM, 13(10), pp. 591-606. — recursive transition networks +- Bazaco, D. (2022). "Building a Regex Engine" blog series. https://www.abstractsyntaxseed.com/blog/regex-engine/introduction — NFA construction and modern regex features - Tree-sitter TreeCursor API: `descendant_index()`, `goto_descendant()` - [ADR-0001: Query Parser](ADR-0001-query-parser.md)