diff --git a/AGENTS.md b/AGENTS.md index 053e77b4..7f04014e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,7 +16,10 @@ - **Index**: - [ADR-0001: Query Parser](docs/adr/ADR-0001-query-parser.md) - [ADR-0002: Diagnostics System](docs/adr/ADR-0002-diagnostics-system.md) - - [ADR-0003: Query Intermediate Representation](docs/adr/ADR-0003-query-intermediate-representation.md) + - ADR-0003: Query Intermediate Representation (superseded by ADR-0004, ADR-0005, ADR-0006, available via git history) + - [ADR-0004: Query IR Binary Format](docs/adr/ADR-0004-query-ir-binary-format.md) + - [ADR-0005: Transition Graph Format](docs/adr/ADR-0005-transition-graph-format.md) + - [ADR-0006: Dynamic Query Execution](docs/adr/ADR-0006-dynamic-query-execution.md) - **Template**: ```markdown @@ -40,6 +43,13 @@ - **Considered Alternatives**: Describe rejected options and why. ``` +## How to write ADRs + +ADRs must be succint and straight to the point. +They must contain examples with high information density and pedagogical value. +These are docs people usually don't want to read, but when they do, they find it quite fascinating. +Avoid imperative code, describe structure definitions, their purpose and how to use them properly. + # Plotnik Query Language Plotnik is a strongly-typed, whitespace-delimited pattern matching language for syntax trees (similar to Tree-sitter but stricter). diff --git a/docs/adr/ADR-0003-query-intermediate-representation.md b/docs/adr/ADR-0003-query-intermediate-representation.md deleted file mode 100644 index f16dfb74..00000000 --- a/docs/adr/ADR-0003-query-intermediate-representation.md +++ /dev/null @@ -1,924 +0,0 @@ -# ADR-0003: Query Intermediate Representation - -- **Status**: Accepted -- **Date**: 2025-12-10 - -## Context - -Plotnik needs to execute queries against tree-sitter syntax trees. The query language supports: - -- Named node and anonymous node matching -- Field constraints and negated fields -- Named definitions with mutual recursion -- Quantifiers (`*`, `+`, `?`) with greedy/non-greedy variants -- Alternations (tagged and untagged) -- Sequences -- Captures with type annotations -- Anchors for strict positional matching - -Plotnik supports two execution modes: - -1. **Proc macro (compile-time)**: Query is compiled to specialized Rust functions. Zero runtime interpretation overhead. Used when query is known at compile time. - -2. **Dynamic (runtime)**: Query is parsed and executed at runtime via graph interpretation. Used when query is provided by user input or loaded from files. - -Both modes share the same intermediate representation (IR). The IR must support efficient execution in both contexts. - -The design evolved through several realizations: - -1. **Thompson-style fragment composition**: We adapted Thompson's technique for composing pattern fragments—alternation, sequence, quantifiers. However, unlike classic NFAs (which handle only regular languages), our representation supports recursion via a return stack. - -2. **Transitions do all the work**: Each transition performs navigation, matching, and effect emission. States are just junction points with no semantics. - -3. **Edge-centric model**: Transitions are primary, states are implicit. The IR is a flat array of transitions, each knowing its successors. The result is a _recursive transition network_—like an NFA but with call/return semantics for definition references. - -## Decision - -We adopt an edge-centric intermediate representation where: - -1. **Transitions are primary**: Each transition carries matching logic, effects, and successor links -2. **States are implicit**: No explicit state objects; transitions point directly to successor transitions -3. **Effects are append-only**: Data construction emits effects to a linear stream. Backtracking truncates to a saved watermark—no complex undo logic, just `Vec::truncate` -4. **Shared IR, different executors**: The same `TransitionGraph` serves both proc macro codegen and dynamic interpretation - -### Core Data Structures - -These structures are used by both execution modes. - -#### Transition Graph Container - -The graph is immutable after construction. We use a single contiguous allocation sliced into typed segments with proper alignment handling. - -```rust -struct TransitionGraph { - data: Arena, // custom type, see Memory Layout & Alignment - // segment offsets (aligned for each type) - successors_offset: u32, - effects_offset: u32, - negated_fields_offset: u32, - data_fields_offset: u32, - variant_tags_offset: u32, - entrypoint_names_offset: u32, - entrypoints_offset: u32, - default_entrypoint: TransitionId, -} - -impl TransitionGraph { - fn new() -> Self; - fn get(&self, id: TransitionId) -> TransitionView<'_>; - fn entry(&self, name: &str) -> Option>; - fn default_entry(&self) -> TransitionView<'_>; - fn field_name(&self, id: DataFieldId) -> &str; - fn tag_name(&self, id: VariantTagId) -> &str; - fn entrypoint_name(&self, entry: &Entrypoint) -> &str; -} -``` - -##### Memory Arena Design - -The single contiguous allocation is divided into typed segments. Each segment is properly aligned for its type, ensuring safe access across all architectures (x86, ARM, WASM). - -**Segment Layout**: - -| Segment | Type | Offset | Alignment | -| ---------------- | ------------------- | ------------------------- | --------- | -| Transitions | `[Transition; N]` | 0 (implicit) | 4 bytes | -| Successors | `[TransitionId; M]` | `successors_offset` | 4 bytes | -| Effects | `[EffectOp; P]` | `effects_offset` | 2 bytes | -| Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 bytes | -| Data Fields | `[u8; R]` | `data_fields_offset` | 1 byte | -| Variant Tags | `[u8; S]` | `variant_tags_offset` | 1 byte | -| Entrypoint Names | `[u8; U]` | `entrypoint_names_offset` | 1 byte | -| Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 bytes | - -Transitions always start at offset 0—no explicit offset stored. The arena base address is allocated with 4-byte alignment, satisfying `Transition`'s requirement. - -Note: `entry(&str)` performs linear scan — O(n) where n = definition count (typically <20). - -##### Memory Layout & Alignment - -Casting `&u8` to `&T` when the address is not aligned to `T` causes traps on WASM and faults on strict ARM. We enforce alignment explicitly via a custom arena type. - -**A. Base Allocation Alignment** - -The arena must be allocated with alignment equal to the maximum of all segment types. We use a custom `Arena` wrapper that tracks the allocation layout for correct deallocation: - -```rust -const ARENA_ALIGN: usize = 4; // align_of::() - -struct Arena { - ptr: *mut u8, - len: usize, -} - -impl Arena { - fn new(len: usize) -> Self { - let layout = std::alloc::Layout::from_size_align(len, ARENA_ALIGN).unwrap(); - let ptr = unsafe { std::alloc::alloc(layout) }; - if ptr.is_null() { - std::alloc::handle_alloc_error(layout); - } - Self { ptr, len } - } -} - -impl Drop for Arena { - fn drop(&mut self) { - let layout = std::alloc::Layout::from_size_align(self.len, ARENA_ALIGN).unwrap(); - unsafe { std::alloc::dealloc(self.ptr, layout) }; - } -} -``` - -Standard `Box<[u8]>` cannot be used here: it assumes 1-byte alignment and would pass incorrect layout to `dealloc`, causing undefined behavior. - -**B. Segment Offset Calculation** - -Each segment offset is rounded up to its type's alignment: - -```rust -fn align_up(offset: usize, align: usize) -> usize { - (offset + align - 1) & !(align - 1) -} - -// Example: if Transitions end at byte 103, Successors (align 4) start at 104 -let successors_offset = align_up(transitions_end, align_of::()); -``` - -**C. Entrypoints Structure** - -Entrypoints use fixed-size metadata with indirect string storage: - -```rust -#[repr(C)] -struct Entrypoint { - name_offset: u32, // index into entrypoint_names segment - name_len: u32, - target: TransitionId, // Index into transitions segment (offset 0) -} -// Size: 12 bytes, Align: 4 bytes -``` - -The `name_offset` points into the `entrypoint_names` segment (u8 array), where alignment is irrelevant. This avoids the alignment hazards of inline variable-length strings. Note: entrypoint names are stored separately from data field names because they serve different purposes—entrypoint names identify subqueries for lookup, while data field names are used in output object construction. - -##### Construction Process - -The graph is built in two passes to ensure a single contiguous allocation: - -1. **Analysis Pass**: Traverse the Query AST to count all elements (transitions, effects, negated fields, strings). -2. **Layout & Allocation**: Compute aligned offsets for all segments and allocate the `Arena` once. -3. **Emission Pass**: Serialize data into the arena. - - **String Tables**: Written sequentially to `u8` segments. - - **Slices**: `Slice` fields are populated with `start` indices relative to their segment base. - - **Structs**: `Transition`, `Entrypoint`, etc., are written using `std::ptr::write` to their calculated offsets. - -This approach eliminates dynamic resizing (`realloc`) and fragmentation. - -##### Slice Resolution - -`Slice` handles are resolved to actual slices by combining: - -1. The segment's base offset (e.g., `effects_offset` for `Slice`) -2. The slice's `start` field (element index within segment) -3. The slice's `len` field - -The `TransitionView` methods (`pre_effects()`, `post_effects()`, `next()`) perform this resolution internally, returning standard `&[T]` slices. Engine code never performs offset arithmetic directly. - -**Access Pattern**: - -The `TransitionView` and `MatcherView` types provide safe access by: - -- Resolving `Slice` handles to actual slices within the appropriate segment -- Converting relative indices to absolute pointers -- Hiding all offset arithmetic from the query engine - -This design achieves: - -- **Cache efficiency**: All graph data in one contiguous allocation -- **Memory efficiency**: No per-node allocations, minimal overhead -- **Type safety**: Phantom types ensure slices point to correct segments -- **Zero-copy**: Direct references into the arena, no cloning - -#### Transition View - -`TransitionView` bundles a graph reference with a transition, enabling ergonomic access without explicit slice resolution: - -```rust -struct TransitionView<'a> { - graph: &'a TransitionGraph, - raw: &'a Transition, -} - -impl<'a> TransitionView<'a> { - fn matcher(&self) -> MatcherView<'a>; - fn next(&self) -> impl Iterator>; - fn pre_effects(&self) -> &[EffectOp]; - fn post_effects(&self) -> &[EffectOp]; - fn is_pre_anchored(&self) -> bool; - fn is_post_anchored(&self) -> bool; - fn ref_marker(&self) -> Option<&RefTransition>; -} - -struct MatcherView<'a> { - graph: &'a TransitionGraph, - raw: &'a Matcher, -} - -impl<'a> MatcherView<'a> { - fn kind(&self) -> MatcherKind; - fn node_kind(&self) -> Option; - fn field(&self) -> Option; - fn negated_fields(&self) -> &[NodeFieldId]; // resolved from Slice - fn matches(&self, cursor: &mut TreeCursor) -> bool; -} - -enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } -``` - -**Execution Flow**: - -The engine traverses transitions following this pattern: - -1. **Pre-effects** execute unconditionally before any matching attempt -2. **Matching** determines whether to proceed: - - With matcher: Test against current cursor position - - Without matcher (epsilon): Always proceed -3. **On successful match**: Implicitly capture the node, execute post-effects -4. **Successors** are processed recursively, with appropriate backtracking - -The `TransitionView` abstraction hides all segment access complexity. The same logical flow applies to both execution modes—dynamic interpretation emits effects while proc-macro generation produces direct construction code. - -#### Slice Handle - -A compact, relative reference to a contiguous range within a segment. Replaces `&[T]` to keep structs self-contained. - -```rust -#[repr(C)] -struct Slice { - start: u32, // Index within segment - len: u32, // Number of items - _phantom: PhantomData, -} - -impl Slice { - const EMPTY: Self = Self { start: 0, len: 0, _phantom: PhantomData }; -} - -// Note: PhantomData does not prevent crossing segment boundaries (e.g. passing -// Slice where Slice is expected). Implementers should -// consider wrapping these in newtypes if stricter compile-time safety is required. -``` - -Size: 8 bytes. Using `u32` for both fields fills the natural alignment with no padding waste, supporting up to 4B items per slice—well beyond any realistic query. - -#### Raw Transition - -Internal storage. Engine code uses `TransitionView` instead of accessing this directly. - -```rust -#[repr(C)] -struct Transition { - matcher: Matcher, // 16 bytes (Epsilon variant for epsilon-transitions) - pre_anchored: bool, // 1 byte - post_anchored: bool, // 1 byte - _pad1: [u8; 2], // 2 bytes padding - pre_effects: Slice, // 8 bytes - post_effects: Slice, // 8 bytes - ref_marker: RefTransition, // 4 bytes (explicit None variant) - next: Slice, // 8 bytes -} -// Size: 48 bytes, Align: 4 bytes -``` - -The `TransitionView` resolves `Slice` by combining the graph's segment offset with the slice's start/len fields. - -**Design Note**: The `ref_marker` field is intentionally a single `RefTransition` rather than a `Slice`. This means a transition can carry at most one Enter or Exit marker. While this prevents full epsilon elimination for nested reference sequences (e.g., `Enter(A) → Enter(B)`), we accept this limitation for simplicity. Such sequences remain as chains of epsilon transitions in the final graph. - -```rust -type TransitionId = u32; -type DataFieldId = u16; -type VariantTagId = u16; -type RefId = u16; -``` - -Each named definition has an entry point. The default entry is the last definition. Multiple entry points share the same transition graph. - -#### Matcher - -Note: `NodeTypeId` is `u16`. `NodeFieldId` is `NonZeroU16`, which guarantees `Option` uses value `0` for `None` — a stable layout suitable for raw serialization. Both types are defined in `plotnik-core`. - -```rust -#[repr(C, u32)] -enum Matcher { - Epsilon, // no payload - Node { - kind: NodeTypeId, // 2 bytes - field: Option, // 2 bytes - negated_fields: Slice, // 8 bytes - }, - Anonymous { - kind: NodeTypeId, // 2 bytes - field: Option, // 2 bytes - }, - Wildcard, - Down, - Up, -} -// Size: 16 bytes (4-byte discriminant + 12-byte largest variant). Align: 4 bytes -``` - -Navigation variants `Down`/`Up` move the cursor without matching. They enable nested patterns like `(function_declaration (identifier) @name)` where we must descend into children. - -#### Reference Markers - -```rust -#[repr(C, u8)] // Explicit 1-byte discriminant for stable serialization -enum RefTransition { - None, // no marker (discriminant 0) - Enter(RefId), // push ref_id onto return stack (discriminant 1) - Exit(RefId), // pop from return stack, must match ref_id (discriminant 2) -} -// Size: 4 bytes (1-byte discriminant + 1-byte padding + 2-byte RefId), Align: 2 bytes -``` - -The explicit `None` variant (rather than `Option`) ensures stable binary layout for raw arena serialization. `Option` relies on compiler-specific niche optimization whose bit-pattern is unspecified. - -Thompson construction creates epsilon transitions with `Enter`/`Exit` markers. Epsilon elimination propagates these markers to surviving transitions. At runtime, the engine uses markers to filter which `next` transitions are valid based on return stack state. Multiple transitions can share the same `RefId` after epsilon elimination. - -#### Effect Operations - -Instructions stored in the transition graph. These are static, `Copy`, and contain no runtime data. - -```rust -#[derive(Clone, Copy)] -#[repr(C)] -enum EffectOp { - StartArray, // push new [] onto container stack - PushElement, // move current value into top array - EndArray, // pop array from stack, becomes current - StartObject, // push new {} onto container stack - EndObject, // pop object from stack, becomes current - Field(DataFieldId), // move current value into field on top object - StartVariant(VariantTagId), // push variant tag onto container stack - EndVariant, // pop variant from stack, wrap current, becomes current - ToString, // convert current Node value to String (source text) -} -// Size: 4 bytes (1-byte discriminant + 2-byte payload + 1-byte padding), Align: 2 bytes -``` - -Note: There is no `CaptureNode` instruction. Node capture is implicit—a successful match automatically emits `RuntimeEffect::CaptureNode` to the effect stream (see below). - -Effects capture structure only—arrays, objects, variants. Type annotations (`:: str`, `:: Type`) are separate metadata applied during post-processing. - -##### Effect Placement Rules - -After epsilon elimination, effects are classified as pre or post based on when they must execute relative to the match: - -| Effect | Placement | Reason | -| -------------- | --------- | ------------------------------------------ | -| `StartArray` | Pre | Container must exist before elements added | -| `StartObject` | Pre | Container must exist before fields added | -| `StartVariant` | Pre | Tag must be set before payload captured | -| `PushElement` | Post | Consumes the just-matched node | -| `Field` | Post | Consumes the just-matched node | -| `EndArray` | Post | Finalizes after last element matched | -| `EndObject` | Post | Finalizes after last field matched | -| `EndVariant` | Post | Wraps payload after it's captured | -| `ToString` | Post | Converts the just-matched node to text | - -Pre-effects from incoming epsilon paths accumulate in order. Post-effects from outgoing epsilon paths accumulate in order. This ordering is deterministic and essential for correct data construction. - -### Data Construction (Dynamic Interpreter) - -This section describes data construction for the dynamic interpreter. Proc-macro codegen uses direct construction instead (see [Direct Construction](#direct-construction-no-effect-stream)). - -The interpreter emits events to a linear stream during matching. After a successful match, the stream is executed to build the output. - -#### Runtime Effects - -Events emitted to the effect stream during interpretation. Unlike `EffectOp`, these carry runtime data. - -```rust -enum RuntimeEffect<'a> { - Op(EffectOp), // forwarded instruction from graph - CaptureNode(Node<'a>), // emitted implicitly on successful match -} -``` - -The `CaptureNode` variant is never stored in the graph—it's generated by the interpreter when a match succeeds. This separation keeps the graph static (no lifetimes) while allowing the runtime stream to carry actual node references. - -#### Effect Stream - -```rust -/// Accumulates runtime effects during matching; supports rollback on backtrack -struct EffectStream<'a> { - effects: Vec>, -} -``` - -The effect stream accumulates effects linearly during matching. It provides: - -- **Effect emission**: Appends `EffectOp` instructions and `CaptureNode` events -- **Watermarking**: Records position before attempting branches -- **Rollback**: Truncates to saved position on backtrack - -This append-only design makes backtracking trivial—just truncate the vector. No complex undo logic needed. - -#### Execution Model - -Two separate concepts during effect execution: - -1. **Current value** — the last matched node or just-completed container -2. **Container stack** — objects and arrays being built - -```rust -struct Executor<'a> { - current: Option>, // last matched node or completed container - stack: Vec>, // objects/arrays being built -} - -// Result is the final `current` value after execution completes. -// This allows returning any value type: Object, Array, Node, String, or Variant. - -enum Value<'a> { - Node(Node<'a>), // AST node reference - String(String), // Text values (from @capture :: string) - Array(Vec>), // completed array - Object(BTreeMap>), // completed object (BTreeMap for deterministic iteration) - Variant(VariantTagId, Box>), // tagged variant (tag + payload) -} - -enum Container<'a> { - Array(Vec>), // array under construction - Object(BTreeMap>), // object under construction - Variant(VariantTagId), // variant tag; EndVariant wraps current value -} -``` - -Effect semantics on `current`: - -- `CaptureNode(node)` → sets `current` to `Value::Node(node)` -- `Field(id)` → moves `current` into top object, clears to `None` -- `PushElement` → moves `current` into top array, clears to `None` -- `End*` → pops container from stack into `current` -- `ToString` → replaces `current` Node with its source text as String - -**Error Handling** - -The interpreter assumes the effect stream is well-formed (guaranteed by the query compiler). - -- **Panic**: Any operation on invalid state (e.g., `Field` when `current` is `None`, `EndArray` with empty stack, `ToString` on non-Node). These indicate bugs in the IR construction. - -#### Execution Pipeline - -For any given transition, the execution order is strict to ensure data consistency during backtracking: - -1. **Enter**: Push `Frame` with current `effect_stream.watermark()`. -2. **Pre-Effects**: Emit `pre_effects` as `RuntimeEffect::Op(...)`. -3. **Match**: Validate node kind/fields. If fail, rollback to watermark and abort. -4. **Capture**: Emit `RuntimeEffect::CaptureNode(matched_node)` — implicit, not from graph. -5. **Post-Effects**: Emit `post_effects` as `RuntimeEffect::Op(...)`. -6. **Exit**: Pop `Frame` (validate return). - -This order ensures correct behavior during epsilon elimination. Pre-effects run before the match overwrites `current`, allowing effects like `PushElement` to be safely merged from preceding epsilon transitions. Post-effects run after, for effects that need the newly matched node. - -The key insight: `CaptureNode` is generated by the interpreter on successful match, not stored as an instruction. The graph only contains structural operations (`EffectOp`); the runtime stream (`RuntimeEffect`) adds the actual node data. - -#### Example - -Query: - -``` -Func = (function_declaration - name: (identifier) @name - parameters: (parameters (identifier)* @params :: string)) -``` - -Input: `function foo(a, b) {}` - -Runtime effect stream (showing `EffectOp` from graph vs implicit `CaptureNode`): - -``` -graph pre: Op(StartObject) -implicit: CaptureNode(foo) ← from successful match -graph post: Op(Field("name")) -graph pre: Op(StartArray) -implicit: CaptureNode(a) ← from successful match -graph post: Op(ToString) -graph post: Op(PushElement) -implicit: CaptureNode(b) ← from successful match -graph post: Op(ToString) -graph post: Op(PushElement) -graph post: Op(EndArray) -graph post: Op(Field("params")) -graph post: Op(EndObject) -``` - -Note: The graph stores only `EffectOp` instructions. `CaptureNode` events are generated by the interpreter on each successful match—they never appear in `Transition.pre_effects` or `Transition.post_effects`. - -In the raw graph, `EffectOp`s live on epsilon transitions between matches. The pre/post classification determines where they land after epsilon elimination. `StartObject` and `StartArray` are pre-effects (setup before matching). `Field`, `PushElement`, `ToString`, and `End*` are post-effects (consume the matched node or finalize containers). - -Execution trace (key steps, second array element omitted): - -| RuntimeEffect | current | stack | -| ------------------- | ---------- | --------------- | -| Op(StartObject) | - | [{}] | -| CaptureNode(foo) | Node(foo) | [{}] | -| Op(Field("name")) | - | [{name: Node}] | -| Op(StartArray) | - | [{...}, []] | -| CaptureNode(a) | Node(a) | [{...}, []] | -| Op(ToString) | "a" | [{...}, []] | -| Op(PushElement) | - | [{...}, ["a"]] | -| _(repeat for "b")_ | ... | ... | -| Op(EndArray) | ["a", "b"] | [{...}] | -| Op(Field("params")) | - | [{..., params}] | -| Op(EndObject) | {...} | [] | - -Final result: - -```json -{ - "name": "", - "params": ["a", "b"] -} -``` - -### Backtracking - -Two mechanisms work together (same for both execution modes): - -1. **Cursor checkpoint**: `cursor.descendant_index()` returns a `usize` position; `cursor.goto_descendant(pos)` restores it. O(1) save, O(depth) restore, no allocation. - -2. **Effect watermark**: `effect_stream.watermark()` before attempting a branch; `effect_stream.rollback(watermark)` on failure. - -Both execution modes follow the same pattern: save state before attempting a branch; on failure, restore both cursor and effects before trying the next branch. This ensures each alternative starts from the same clean state. - -``` - -### Quantifiers - -Quantifiers compile to epsilon transitions with specific `next` ordering: - -**Greedy `*`** (zero or more): - -``` - -Entry ─ε→ [try match first, then exit] -↓ -Match ─ε→ loop back to Entry - -``` - -**Greedy `+`** (one or more): - -``` - - ┌──────────────────────────┐ - ↓ │ - -Entry ─→ Match ─ε→ Loop ─ε→ [try match first, then exit] - -``` - -The `+` quantifier differs from `*`: it enters directly at `Match`, requiring at least one successful match before the exit path becomes available. After the first match, the `Loop` node behaves like `*` (match-first, exit-second). - -**Non-greedy `*?`/`+?`**: - -Same structures as above, but with reversed `next` ordering: exit path has priority over match path. For `+?`, after the mandatory first match, the loop prefers exiting over matching more. - -### Arrays - -Array construction uses epsilon transitions with effects: - -``` - -T0: ε + StartArray next: [T1] // pre-effect: setup array -T1: ε (branch) next: [T2, T4] // try match or exit -T2: Match(expr) next: [T3] -T3: ε + PushElement next: [T1] // post-effect: consume matched node -T4: ε + EndArray next: [T5] // post-effect: finalize array -T5: ε + Field("items") next: [...] // post-effect: assign to field - -``` - -After epsilon elimination, `PushElement` from T3 merges into T2 as a post-effect. `StartArray` from T0 merges into T2 as a pre-effect (first iteration only—loop iterations enter from T3, not T0). - -Backtracking naturally handles partial arrays: truncating the effect stream removes uncommitted `PushElement` effects. - -### Scopes - -Nested objects from `{...} @name` use `StartObject`/`EndObject` effects: - -``` - -T0: ε + StartObject next: [T1] // pre-effect: setup object -T1: ... (sequence contents) next: [T2] -T2: ε + EndObject next: [T3] // post-effect: finalize object -T3: ε + Field("name") next: [...] // post-effect: assign to field - -``` - -`StartObject` is a pre-effect (merges forward). `EndObject` and `Field` are post-effects (merge backward onto preceding match). - -### Tagged Alternations - -Tagged branches use `StartVariant` to create explicit tagged structures. - -``` - -[ A: (true) ] - -``` - -Effect stream: - -``` - -StartVariant("A") -StartObject -... -EndObject -EndVariant - -```` - -The resulting `Value::Variant` preserves the tag distinct from the payload, preventing name collisions. - -**JSON serialization** always uses `$data` wrapper for uniformity: - -```json -{ "$tag": "A", "$data": { "x": 1, "y": 2 } } -{ "$tag": "B", "$data": [1, 2, 3] } -{ "$tag": "C", "$data": "foo" } -```` - -The `$tag` and `$data` keys avoid collisions with user-defined captures. Uniform structure simplifies parsing (always access `.$data`) and eliminates conditional flatten-vs-wrap logic. - -**Nested variants** (variant containing variant) serialize naturally: - -```json -{ "$tag": "Outer", "$data": { "$tag": "Inner", "$data": 42 } } -``` - -This mirrors Rust's serde adjacently-tagged representation and remains fully readable for LLMs. No query validation restriction—all payload types are valid. - -### Definition References and Recursion - -When a pattern references another definition (e.g., `(Expr)` inside `Binary`), the IR uses `RefId` to identify the call site. Each `Ref` node in the query AST gets a unique `RefId`, which is preserved through epsilon elimination. - -``` -Expr = [ (Num) (Binary) ] -Binary = (binary_expression - left: (Expr) // RefId = 0 - right: (Expr)) // RefId = 1 -``` - -The `RefId` is semantic identity—"which reference in the query pattern"—distinct from `TransitionId` which is structural identity—"which slot in the transition array." - -**Why RefId matters**: Epsilon elimination creates multiple transitions from a single reference. If a reference has 2 input epsilon paths and 3 output epsilon paths, elimination produces 2×3 = 6 transitions. All share the same `RefId` because they represent the same call site. The return stack uses `RefId` so that: - -- Entry can occur via any input path -- Exit can continue via any output path - -**Proc macro**: Each definition becomes a Rust function. References become function calls. Rust's call stack serves as the return stack—`RefId` is implicit in the call site. - -In proc-macro mode, each definition becomes a Rust function. References become direct function calls, with the Rust call stack serving as the implicit return stack. The `RefId` exists only in the IR—the generated code relies on Rust's natural call/return mechanism. - -**Dynamic**: The interpreter maintains an explicit return stack. On `Enter(ref_id)`: - -1. Push frame with `ref_id`, cursor checkpoint, effect stream watermark -2. Follow `next` into the definition body - -On `Exit(ref_id)`: - -1. Verify top frame matches `ref_id` (invariant: mismatched ref_id indicates IR bug) -2. Pop frame -3. Continue to `next` successors unconditionally - -**Entry filtering mechanism**: After epsilon elimination, multiple `Exit` transitions with different `RefId`s may be reachable from the same point (merged from different call sites). The interpreter only takes an `Exit(ref_id)` transition if `ref_id` matches the current stack top. This ensures returns go to the correct call site. - -After taking an `Exit` and popping the frame, successors are followed unconditionally—they represent the continuation after the call. If a successor has an `Enter` marker, that's a _new_ call (e.g., `(A) (B)` where returning from A continues to calling B), not a return path. - -```rust -/// Return stack entry for definition calls -struct Frame { - ref_id: RefId, // which call site we're inside - cursor_checkpoint: usize, // cursor position before call - effect_stream_watermark: usize, // effect count before call -} - -/// Runtime query executor -struct Interpreter<'a> { - graph: &'a TransitionGraph, - return_stack: Vec, // call stack for definition references - cursor: TreeCursor<'a>, // current position in AST - effect_stream: EffectStream<'a>, // effect accumulator -} -``` - -### Epsilon Elimination (Optimization) - -After initial construction, epsilon transitions can be **partially** eliminated by computing epsilon closures. Full elimination is not always possible due to the single `ref_marker` limitation—sequences like `Enter(A) → Enter(B)` cannot be merged into one transition. The `pre_effects`/`post_effects` split is essential for correctness here. - -**Why the split matters**: A match transition overwrites `current` with the matched node. Effects from _preceding_ epsilon transitions (like `PushElement`) need the _previous_ `current` value. Without the split, merging them into a single post-match list would use the wrong value. - -``` -Before (raw graph): -T1: Match(A) next: [T2] // current = A -T2: ε + PushElement next: [T3] // pushes A (correct) -T3: Match(B) next: [...] // current = B - -After elimination (with split): -T3': pre: [PushElement], Match(B), post: [] // PushElement runs before Match(B), pushes A ✓ - -Wrong (without split, effects merged as post): -T3': Match(B) + [PushElement] // PushElement runs after Match(B), pushes B ✗ -``` - -**Accumulation rules**: - -- `EffectOp`s from incoming epsilon paths → accumulate into `pre_effects` -- `EffectOp`s from outgoing epsilon paths → accumulate into `post_effects` - -This is why both are `Slice` rather than `Option`. - -**Reference expansion**: For definition references, epsilon elimination propagates `Enter`/`Exit` markers to surviving transitions: - -``` -Before: -T0: ε next: [T1] -T1: ε + Enter(0) next: [T2] // enter Expr -T2: ... (Expr body) ... next: [T3] -T3: ε + Exit(0) next: [T4] // exit Expr -T4: ε next: [T5] - -After: -T0': Match(...) + Enter(0) next: [T2'] // marker propagated -T3': Match(...) + Exit(0) next: [T5'] // marker propagated -``` - -All expanded entry transitions share the same `RefId`. All expanded exit transitions share the same `RefId`. The engine filters valid continuations at runtime based on stack state—no explicit continuation storage needed. - -**Limitation**: Complete epsilon elimination is impossible when reference markers chain (e.g., nested calls). The single `ref_marker` slot prevents merging `Enter(A) → Enter(B)` sequences. These remain as epsilon transition chains in the final graph. - -This optimization benefits both modes: - -- **Proc macro**: Fewer transitions → less generated code (where elimination is possible) -- **Dynamic**: Fewer graph traversals → faster interpretation (but must handle remaining epsilons) - -### Proc Macro Code Generation - -When used as a proc macro, the transition graph is a compile-time artifact: - -1. Parses query source at compile time -2. Builds transition graph (Thompson-style construction) -3. Optionally eliminates epsilons -4. Generates Rust functions for each definition - -Generated code uses: - -- `if`/`else` chains for alternations -- `while` loops for quantifiers -- Direct function calls for definition references -- `TreeCursor` navigation methods -- `descendant_index()`/`goto_descendant()` for backtracking - -At runtime, there is no graph—just plain Rust code. - -#### Direct Construction (No Effect Stream) - -Unlike the dynamic interpreter, proc-macro generated code constructs output values directly—no intermediate effect stream. Output structs are built in a single pass as matching proceeds. - -Backtracking in direct construction means dropping partially-built values and re-allocating. This is acceptable because modern allocators maintain thread-local free lists, making the alloc→drop→alloc pattern for small objects essentially O(1). - -### Dynamic Execution - -When used dynamically, the transition graph is interpreted at runtime: - -1. Parses query source at runtime -2. Builds transition graph -3. Optionally eliminates epsilons (can be skipped for faster startup) -4. Interpreter walks the graph, executing transitions - -The interpreter maintains: - -- Current transition pointer -- Explicit return stack for definition calls -- Cursor position -- `RuntimeEffect` stream with watermarks - -Unlike proc-macro codegen, the dynamic interpreter uses the `RuntimeEffect` stream approach. This is necessary because: - -- We don't know the output structure at compile time -- `RuntimeEffect` stream provides a uniform way to build any output shape -- Backtracking via `truncate()` is simple and correct - -Trade-off: More flexible (runtime query construction), but slower than generated code due to interpretation overhead and the extra effect execution pass. - -## Execution Mode Comparison - -| Aspect | Proc Macro | Dynamic | -| ----------------- | -------------------------- | ----------------------------- | -| Query source | Compile-time literal | Runtime string | -| Graph lifetime | Compile-time only | Runtime | -| Data construction | Direct (no effect stream) | `RuntimeEffect` stream + exec | -| Definition calls | Rust function calls | Explicit return stack | -| Return stack | Rust call stack | `Vec` | -| Backtracking | Drop + re-alloc | `truncate()` effects | -| Performance | Zero dispatch, single pass | Interpretation + 2 pass | -| Type safety | Compile-time checked | Runtime types | -| Use case | Known queries | User-provided queries | - -## Consequences - -### Positive - -- **Shared IR**: One representation serves both execution modes -- **Proc macro zero-overhead**: Generated code is plain Rust with no dispatch -- **Pre-allocated graph**: Single contiguous allocation -- **Dynamic flexibility**: Queries can be constructed or modified at runtime -- **Optimizable**: Epsilon elimination benefits both modes -- **Multiple entry points**: Same graph supports querying any definition -- **Clean separation**: `EffectOp` (static instructions) vs `RuntimeEffect` (dynamic events) eliminates lifetime issues - -### Negative - -- **Two code paths**: Must maintain both codegen and interpreter -- **Different data construction**: Proc macro uses direct construction, dynamic uses `RuntimeEffect` stream -- **Proc macro compile cost**: Complex queries generate more code -- **Dynamic runtime cost**: Interpretation overhead + effect execution pass -- **Testing burden**: Must verify both modes produce identical results - -### Runtime Safety - -Both execution modes require fuel mechanisms to prevent runaway execution: - -- **runtime_fuel**: Decremented on each transition, prevents infinite loops -- **recursion_fuel**: Decremented on each `Enter` marker, prevents stack overflow - -These mechanisms deserve their own ADR (fuel budget design, configurable limits, error reporting on exhaustion). The IR itself carries no fuel-related data—fuel checking is purely an interpreter/codegen concern. - -**Note**: Static loop detection (e.g., direct recursion like `A = (A)` or mutual recursion like `A = (B)`, `B = (A)`) is handled at the query parser level before IR construction. The IR assumes well-formed input without infinite loops in the pattern structure itself. - -### WASM Compatibility - -The IR design is WASM-compatible: - -- **Single arena allocation**: No fragmentation concerns in linear memory. Note: WASM linear memory grows in 64KB pages; the arena coexists with other allocations (e.g., tree-sitter's memory) but this is standard for any WASM allocation. -- **Explicit alignment**: Arena allocated with `std::alloc::Layout`, segment offsets computed with `align_up()`. Prevents misaligned access traps on WASM and strict ARM. -- **`u32` offsets**: All segment offsets are `u32`, matching WASM32's pointer size. 4GB arena limit is sufficient for any query. -- **`BTreeMap` for objects**: Deterministic iteration order ensures reproducible output across platforms. -- **Fixed-size Entrypoints**: The `Entrypoint` struct (12 bytes, align 4) avoids variable-length inline strings that would cause alignment hazards. -- **No platform-specific primitives**: All types are portable (`u16`, `u32`, byte arrays). -- **Allocator Independence**: Uses `std::alloc::alloc` via `Layout`. On `wasm32-unknown-unknown`, this defaults to the system allocator. Implementers targeting other environments (e.g., Emscripten) must ensure a global allocator is configured. - -#### Serialization Format - -The arena uses a simple binary format for caching compiled queries to disk. The current scope is limited to same-machine, same-version usage (e.g., caching a compiled query between CLI invocations). Cross-architecture portability and version migration are explicitly out of scope for this ADR and will be addressed in future work if needed. - -- **Validation**: The `magic` bytes must be `b"PLNK"`. The `version` field must match the exact compiler version AND platform ABI hash (pointer width + endianness). Any mismatch invalidates the cache. -- **Byte order**: Native (little-endian on x86/ARM/WASM). No byte-swapping is performed. -- **String encoding**: UTF-8 for all string data (entrypoint names, data field names, variant tags). -- **Layout**: Header followed by raw arena bytes: - -``` -Header (16 bytes): - magic: [u8; 4] // "PLNK" - version: u32 // format version (must match exactly) - arena_len: u32 // byte length of arena data - segment_count: u32 // number of segment offset entries - -Segment Offsets (segment_count × 4 bytes): - [u32; segment_count] // successors_offset, effects_offset, ... - -Arena Data (arena_len bytes): - [u8; arena_len] // raw arena bytes, used directly without fixup -``` - -**Loading**: The loader verifies magic, version, and arena length. If any mismatch occurs, the cache is invalidated and the query is recompiled. No byte-swapping or layout fixup is performed—mismatched architectures simply trigger recompilation. - -### Considered Alternatives - -1. **Proc macro only** - - Rejected: Need runtime query support for tooling and user-defined queries - -2. **Dynamic only** - - Rejected: Unacceptable performance overhead for known queries - -3. **Separate IRs for each mode** - - Rejected: Duplication; harder to ensure semantic equivalence - -4. **State-centric graph representation** - - Rejected: States carry no semantic weight; edge-centric is simpler - -5. **Vectorized Reference Markers (`Vec`)** - - Rejected: Optimized for alias chains (e.g. `A = B`, `B = C`) to allow full epsilon elimination. However, this bloats the `Transition` struct for all other cases. Standard epsilon elimination is sufficient; traversing a few remaining epsilon transitions for aliases is cheaper than increasing memory pressure on the whole graph. - -6. **Portable binary format** - - Deferred: Cross-architecture serialization would require byte-swapping and layout fixups. Current scope is same-machine caching only; portability can be added later if needed. - -## References - -- Bazaco, D. (2022). "Building a Regex Engine" blog series. https://www.abstractsyntaxseed.com/blog/regex-engine/introduction — NFA construction and modern regex features -- Tree-sitter TreeCursor API: `descendant_index()`, `goto_descendant()` -- [ADR-0001: Query Parser](ADR-0001-query-parser.md) diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md new file mode 100644 index 00000000..d6e3da95 --- /dev/null +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -0,0 +1,148 @@ +# ADR-0004: Query IR Binary Format + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of ADR-0003 + +## Context + +The Query IR lives in a single contiguous allocation—cache-friendly, zero fragmentation, portable to WASM. This ADR defines the binary layout. Graph structures are in [ADR-0005](ADR-0005-transition-graph-format.md). + +## Decision + +### Container + +```rust +struct QueryIR { + ir_buffer: QueryIRBuffer, + successors_offset: u32, + effects_offset: u32, + negated_fields_offset: u32, + string_refs_offset: u32, + string_bytes_offset: u32, + type_info_offset: u32, + entrypoints_offset: u32, +} +``` + +Transitions start at offset 0. Default entrypoint is always at offset 0. + +### QueryIRBuffer + +```rust +const BUFFER_ALIGN: usize = 64; // cache-line alignment for transitions + +struct QueryIRBuffer { + ptr: *mut u8, + len: usize, +} +``` + +Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. The 64-byte alignment ensures transitions never straddle cache lines. + +### Segments + +| Segment | Type | Offset | Align | +| -------------- | ------------------- | ----------------------- | ----- | +| Transitions | `[Transition; N]` | 0 | 64 | +| Successors | `[TransitionId; M]` | `successors_offset` | 4 | +| Effects | `[EffectOp; P]` | `effects_offset` | 2 | +| Negated Fields | `[NodeFieldId; Q]` | `negated_fields_offset` | 2 | +| String Refs | `[StringRef; R]` | `string_refs_offset` | 4 | +| String Bytes | `[u8; S]` | `string_bytes_offset` | 1 | +| Type Info | `[TypeInfo; U]` | `type_info_offset` | 4 | +| Entrypoints | `[Entrypoint; T]` | `entrypoints_offset` | 4 | + +Each offset is aligned: `(offset + align - 1) & !(align - 1)`. + +### Stringsi + +Single pool for all strings (field names, variant tags, entrypoint names): + +```rust +#[repr(C)] +struct StringRef { + offset: u32, // into string_bytes + len: u16, + _pad: u16, +} + +#[repr(C)] +struct Entrypoint { + name_id: u16, // into string_refs + _pad: u16, + target: TransitionId, +} +``` + +`DataFieldId(u16)` and `VariantTagId(u16)` index into `string_refs`. Distinct types, same table. + +Strings are interned during construction—identical strings share storage and ID. + +### Serialization + +``` +Header (44 bytes): + magic: [u8; 4] b"PLNK" + version: u32 format version + ABI hash + checksum: u32 CRC32(offsets || buffer_data) + buffer_len: u32 + successors_offset: u32 + effects_offset: u32 + negated_fields_offset: u32 + string_refs_offset: u32 + string_bytes_offset: u32 + type_info_offset: u32 + entrypoints_offset: u32 + +Buffer Data (buffer_len bytes) +``` + +Little-endian always. UTF-8 strings. Version mismatch or checksum failure → recompile. + +### Construction + +Three passes: + +1. **Analysis**: Count elements, intern strings +2. **Layout**: Compute aligned offsets, allocate once +3. **Emission**: Write via `ptr::write` + +No `realloc`. + +### Example + +Query: + +``` +Func = (function_declaration name: (identifier) @name) +Expr = [ Ident: (identifier) @name Num: (number) @value ] +``` + +Buffer layout: + +``` +0x0000 Transitions [T0, T1, T2, ...] +0x0180 Successors [1, 2, 3, ...] +0x0200 Effects [StartObject, Field(0), ...] +0x0280 Negated Fields [] +0x0280 String Refs [{0,4}, {4,5}, {9,5}, ...] +0x02C0 String Bytes "namevalueIdentNumFuncExpr" +0x0300 Type Info [...] +0x0340 Entrypoints [{4, T0}, {5, T3}] +``` + +`"name"` stored once, used by both `@name` captures. + +## Consequences + +**Positive**: Cache-efficient, O(1) string lookup, zero-copy access, simple validation. + +**Negative**: Format changes require rebuild. No version migration. + +**WASM**: Explicit alignment prevents traps. `u32` offsets fit WASM32. + +## References + +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md new file mode 100644 index 00000000..8da2807b --- /dev/null +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -0,0 +1,310 @@ +# ADR-0005: Transition Graph Format + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of ADR-0003 + +## Context + +Edge-centric IR: transitions carry all semantics (matching, effects, successors). States are implicit junction points. The result is a recursive transition network—NFA with call/return for definition references. + +## Decision + +### Types + +```rust +type TransitionId = u32; +type NodeTypeId = u16; // from tree-sitter, do not change +type NodeFieldId = NonZeroU16; // from tree-sitter, Option uses 0 for None +type DataFieldId = u16; +type VariantTagId = u16; +type RefId = u16; +``` + +### Slice + +Relative range within a segment: + +```rust +#[repr(C)] +struct Slice { + start: u32, + len: u32, + _phantom: PhantomData, +} +``` + +### Transition + +```rust +#[repr(C, align(64))] +struct Transition { + // --- 40 bytes metadata --- + matcher: Matcher, // 16 + pre_anchored: bool, // 1 + post_anchored: bool, // 1 + _pad1: [u8; 2], // 2 + pre_effects: Slice, // 8 + post_effects: Slice, // 8 + ref_marker: RefTransition, // 4 + + // --- 24 bytes control flow --- + successor_count: u32, // 4 + successor_data: [u32; 5], // 20 +} +// 64 bytes, align 64 (cache-line aligned) +``` + +Single `ref_marker` slot—sequences like `Enter(A) → Enter(B)` remain as epsilon chains. + +### Inline Successors (SSO-style) + +Successors use a small-size optimization to avoid indirection for the common case: + +| `successor_count` | Layout | +| ----------------- | ------------------------------------------------------------------------------------ | +| 0–5 | `successor_data[0..count]` contains `TransitionId` values directly | +| > 5 | `successor_data[0]` is offset into `successors` segment, `successor_count` is length | + +Why 5 slots: 24 available bytes / 4 bytes per `TransitionId` = 6 slots, minus 1 for the count field leaves 5. + +Coverage: + +- Linear sequences: 1 successor +- Simple branches, quantifiers: 2 successors +- Most alternations: 2–5 branches + +Only massive alternations (6+ branches) spill to the external buffer. + +Cache benefits: + +- 64 bytes = L1 cache line on x86/ARM64 +- No transition straddles cache lines +- No pointer chase for 99%+ of transitions + +### Matcher + +```rust +#[repr(C, u32)] +enum Matcher { + Epsilon, + Node { + kind: NodeTypeId, // 2 + field: Option, // 2 + negated_fields: Slice, // 8 + }, + Anonymous { + kind: NodeTypeId, // 2 + field: Option, // 2 + negated_fields: Slice, // 8 + }, + Wildcard, + Down, // cursor to first child + Up, // cursor to parent +} +// 16 bytes, align 4 +``` + +`Option` uses 0 for `None` (niche optimization). + +### RefTransition + +```rust +#[repr(C, u8)] +enum RefTransition { + None, + Enter(RefId), // push call frame with returns + Exit(RefId), // pop frame, use stored returns +} +// 4 bytes, align 2 +``` + +Explicit `None` ensures stable binary layout (`Option` niche is unspecified). + +### Enter/Exit Semantics + +**Problem**: A definition can be called from multiple sites. Naively, `Exit.next` would contain all possible return points from all call sites, requiring O(N) filtering at runtime to find which return is valid for the current call. + +**Solution**: Store return transitions at `Enter` time (in the call frame), retrieve at `Exit` time. O(1) exit, no filtering. + +For `Enter(ref_id)` transitions, `successor_data` has special structure: + +- `successor_data[0]`: definition entry point (where to jump) +- `successor_data[1..count]`: return transitions (stored in call frame) + +For `Exit(ref_id)` transitions, successors are **ignored**. Return transitions come from the call frame pushed at `Enter`. See [ADR-0006](ADR-0006-dynamic-query-execution.md) for execution details. + +``` +Call site: +T1: ε + Enter(Func) successors=[T10, T2, T3] + │ └─────┴─── return transitions (stored in frame) + └─────────────── definition entry +``` + +Definition: +T10: Match(...) successors=[T11] +T11: ε + Exit(Func) successors=[] (ignored, returns from frame) + +```` + +### EffectOp + +```rust +#[repr(C, u16)] +enum EffectOp { + StartArray, + PushElement, + EndArray, + StartObject, + EndObject, + Field(DataFieldId), + StartVariant(VariantTagId), + EndVariant, + ToString, +} +// 4 bytes, align 2 +```` + +No `CaptureNode`—implicit on successful match. + +### Effect Placement + +| Effect | Placement | Why | +| -------------- | --------- | -------------------------- | +| `StartArray` | Pre | Container before elements | +| `StartObject` | Pre | Container before fields | +| `StartVariant` | Pre | Tag before payload | +| `PushElement` | Post | Consumes matched node | +| `Field` | Post | Consumes matched node | +| `End*` | Post | Finalizes after last match | +| `ToString` | Post | Converts matched node | + +### View Types + +```rust +struct TransitionView<'a> { + query_ir: &'a QueryIR, + raw: &'a Transition, +} + +struct MatcherView<'a> { + query_ir: &'a QueryIR, + raw: &'a Matcher, +} + +enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } +``` + +Views resolve `Slice` to `&[T]`. `TransitionView::successors()` returns `&[TransitionId]`, hiding the inline/spilled distinction—callers see a uniform slice regardless of storage location. Engine code never touches offsets or `successor_data` directly. + +### Quantifiers + +**Greedy `*`**: + +``` + ┌─────────────────┐ + ↓ │ +Entry ─ε→ Branch ─ε→ Match ─┘ + │ + └─ε→ Exit + +Branch.next = [match, exit] +``` + +**Greedy `+`**: + +``` + ┌─────────────────┐ + ↓ │ +Entry ─→ Match ─ε→ Branch ─┘ + │ + └─ε→ Exit + +Branch.next = [match, exit] +``` + +**Non-greedy `*?`/`+?`**: Same, but `Branch.next = [exit, match]`. + +### Example: Array + +Query: `(parameters (identifier)* @params)` + +Before elimination: + +``` +T0: ε + StartArray → [T1] +T1: ε (branch) → [T2, T4] +T2: Match(identifier) → [T3] +T3: ε + PushElement → [T1] +T4: ε + EndArray → [T5] +T5: ε + Field("params") → [...] +``` + +After: + +``` +T2': pre:[StartArray] Match(identifier) post:[PushElement] → [T2', T4'] +T4': post:[EndArray, Field("params")] → [...] +``` + +First iteration gets `StartArray` from T0's path. Loop iterations skip it. + +### Example: Object + +Query: `{ (identifier) @name (number) @value } @pair` + +``` +T0: ε + StartObject → [T1] +T1: Match(identifier) → [T2] +T2: ε + Field("name") → [T3] +T3: Match(number) → [T4] +T4: ε + Field("value") → [T5] +T5: ε + EndObject → [T6] +T6: ε + Field("pair") → [...] +``` + +### Example: Tagged Alternation + +Query: `[ A: (true) @val B: (false) @val ]` + +``` +T0: ε (branch) → [T1, T4] +T1: ε + StartVariant("A") → [T2] +T2: Match(true) → [T3] +T3: ε + Field("val") + EndVariant → [T7] +T4: ε + StartVariant("B") → [T5] +T5: Match(false) → [T6] +T6: ε + Field("val") + EndVariant → [T7] +``` + +### Epsilon Elimination + +Partial—full elimination impossible due to single `ref_marker`. + +Why pre/post split matters: + +``` +Before: +T1: Match(A) → [T2] // current = A +T2: ε + PushElement → [T3] // push A ✓ +T3: Match(B) → [...] // current = B + +After (correct): +T3': pre:[PushElement] Match(B) // push A, then match B ✓ + +Wrong (no split): +T3': Match(B) post:[PushElement] // match B, push B ✗ +``` + +Incoming epsilon effects → `pre_effects`. Outgoing → `post_effects`. + +## Consequences + +**Positive**: No state objects. Cache-line aligned 64-byte transitions eliminate cache straddling. Inline successors remove pointer chasing for common cases. Views hide offset arithmetic and inline/spilled distinction. + +**Negative**: Single `ref_marker` leaves some epsilon chains. 33% size increase over minimal layout (acceptable for KB-scale query binaries). + +## References + +- [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md new file mode 100644 index 00000000..160deaca --- /dev/null +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -0,0 +1,198 @@ +# ADR-0006: Dynamic Query Execution + +- **Status**: Accepted +- **Date**: 2025-12-12 +- **Supersedes**: Parts of ADR-0003 + +## Context + +Runtime interpretation of the transition graph ([ADR-0005](ADR-0005-transition-graph-format.md)). Proc-macro compilation is a future ADR. + +## Decision + +### Execution Order + +For each transition: + +1. Emit `pre_effects` +2. Match (epsilon always succeeds) +3. On success: emit `CaptureNode`, emit `post_effects` +4. Process successors with backtracking + +### Effect Stream + +```rust +struct EffectStream<'a> { + effects: Vec>, // append-only, backtrack via truncate +} + +enum RuntimeEffect<'a> { + Op(EffectOp), + CaptureNode(Node<'a>), // implicit on match, never in IR +} +``` + +### Executor + +Converts effect stream to output value. + +```rust +struct Executor<'a> { + current: Option>, + stack: Vec>, +} + +enum Value<'a> { + Node(Node<'a>), + String(String), + Array(Vec>), + Object(BTreeMap>), + Variant(VariantTagId, Box>), +} + +enum Container<'a> { + Array(Vec>), + Object(BTreeMap>), + Variant(VariantTagId), +} +``` + +| Effect | Action | +| ------------------- | ------------------------------------ | +| `CaptureNode(n)` | `current = Node(n)` | +| `StartArray` | push `Array([])` onto stack | +| `PushElement` | move `current` into top array | +| `EndArray` | pop array into `current` | +| `StartObject` | push `Object({})` onto stack | +| `Field(id)` | move `current` into top object field | +| `EndObject` | pop object into `current` | +| `StartVariant(tag)` | push `Variant(tag)` onto stack | +| `EndVariant` | pop, wrap `current`, set as current | +| `ToString` | replace `current` Node with text | + +Invalid state = IR bug → panic. + +### Interpreter + +```rust +struct Interpreter<'a> { + query_ir: &'a QueryIR, + backtrack_stack: BacktrackStack, + recursion_stack: RecursionStack, + cursor: TreeCursor<'a>, // created at tree root, never reset + effects: EffectStream<'a>, +} +``` + +**Cursor constraint**: The cursor must be created once at the tree root and never call `reset()`. This preserves `descendant_index` validity for backtracking checkpoints. + +Two stacks interact: backtracking can restore to a point inside a previously-exited call, so the recursion stack must preserve frames. + +### Backtracking + +```rust +struct BacktrackStack { + points: Vec, +} + +struct BacktrackPoint { + cursor_checkpoint: u32, // tree-sitter descendant_index + effect_watermark: u32, + recursion_frame: Option, // saved frame index + alternatives: Slice, +} +``` + +| Operation | Action | +| --------- | ------------------------------------------------------ | +| Save | `cursor_checkpoint = cursor.descendant_index()` — O(1) | +| Restore | `cursor.goto_descendant(cursor_checkpoint)` — O(depth) | + +Restore also truncates `effects` to `effect_watermark` and sets `recursion_stack.current` to `recursion_frame`. + +### Recursion + +**Problem**: A definition can be called from N sites. Naively, Exit's successors contain all N return points, requiring O(N) filtering. + +**Solution**: Store returns in call frame at `Enter`, retrieve at `Exit`. O(1), no filtering. + +```rust +struct RecursionStack { + frames: Vec, // append-only + current: Option, // index into frames, not depth +} + +struct CallFrame { + parent: Option, // index of caller's frame + ref_id: RefId, // verify Exit matches Enter + returns: Slice, // from Enter.successors()[1..] +} +``` + +**Append-only invariant**: Frames are never removed. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. + +| Operation | Action | +| ----------------- | ------------------------------------------------------------------------------ | +| `Enter(ref_id)` | Push frame (parent = `current`), set `current = len-1`, follow `successors[0]` | +| `Exit(ref_id)` | Verify ref_id, set `current = frame.parent`, continue with `frame.returns` | +| Save backtrack | Store `current` | +| Restore backtrack | Set `current` to saved value | + +**Why index instead of depth?** Using logical depth breaks on Enter-Exit-Enter sequences: + +``` +Main = [(A) (B)] +A = (identifier) +B = (number) +Input: boolean + +# Broken (depth-based): +1. Save BP depth=0 +2. Enter(A) push FA, depth=1 +3. Match identifier ✗ +4. Exit(A) depth=0 +5. Restore BP depth=0 +6. Enter(B) push FB, frames=[FA,FB], depth=1 +7. frames[depth-1] = FA, not FB! ← wrong frame + +# Correct (index-based): +1. Save BP current=None +2. Enter(A) push FA{parent=None}, current=0 +3. Match identifier ✗ +4. Exit(A) current=None +5. Restore BP current=None +6. Enter(B) push FB{parent=None}, current=1 +7. frames[current] = FB ✓ +``` + +Frames form a forest of call chains. Each backtrack point references an exact frame, not a depth. + +### Atomic Groups (Future) + +Cut/commit (discard backtrack points) works correctly: unreachable frames become garbage but cause no issues. + +### Variant Serialization + +```json +{ "$tag": "A", "$data": { ... } } +``` + +`$tag`/`$data` avoid capture name collisions. + +### Fuel + +- `transition_fuel`: decremented per transition +- `recursion_fuel`: decremented per `Enter` + +Details deferred. + +## Consequences + +**Positive**: Append-only stacks make backtracking trivial. O(1) exit via stored returns. Two-phase separation is clean. + +**Negative**: Interpretation overhead. Recursion stack memory grows monotonically (bounded by `recursion_fuel`). + +## References + +- [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md)