diff --git a/AGENTS.md b/AGENTS.md index c1e91de5..587080e8 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -21,6 +21,7 @@ - [ADR-0005: Transition Graph Format](docs/adr/ADR-0005-transition-graph-format.md) - [ADR-0006: Dynamic Query Execution](docs/adr/ADR-0006-dynamic-query-execution.md) - [ADR-0007: Type Metadata Format](docs/adr/ADR-0007-type-metadata-format.md) + - [ADR-0008: Tree Navigation](docs/adr/ADR-0008-tree-navigation.md) - **Template**: ```markdown diff --git a/docs/adr/ADR-0004-query-ir-binary-format.md b/docs/adr/ADR-0004-query-ir-binary-format.md index 12aa8f1a..b21cfcd8 100644 --- a/docs/adr/ADR-0004-query-ir-binary-format.md +++ b/docs/adr/ADR-0004-query-ir-binary-format.md @@ -1,7 +1,7 @@ # ADR-0004: Query IR Binary Format - **Status**: Accepted -- **Date**: 2025-12-12 +- **Date**: 2024-12-12 - **Supersedes**: Parts of ADR-0003 ## Context @@ -23,6 +23,7 @@ struct QueryIR { type_defs_offset: u32, type_members_offset: u32, entrypoints_offset: u32, + ignored_kinds_offset: u32, // 0 = no ignored kinds } ``` @@ -36,12 +37,22 @@ const BUFFER_ALIGN: usize = 64; // cache-line alignment for transitions struct QueryIRBuffer { ptr: *mut u8, len: usize, + owned: bool, // true if allocated, false if mmap'd } ``` Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` won't work—it assumes 1-byte alignment and corrupts `dealloc`. The 64-byte alignment ensures transitions never straddle cache lines. -**Deallocation**: `QueryIRBuffer` must implement `Drop` to reconstruct the exact `Layout` (size + 64-byte alignment) and call `std::alloc::dealloc`. Using `Box::from_raw` or similar would assume align=1 and cause undefined behavior. +**Ownership semantics**: + +| `owned` | Source | `Drop` action | +| ------- | ------------------- | ------------------------------------------------ | +| `true` | `std::alloc::alloc` | Reconstruct `Layout`, call `std::alloc::dealloc` | +| `false` | `mmap` / external | No-op (caller manages lifetime) | + +For mmap'd queries, the OS maps file pages directly into address space. The 64-byte header ensures buffer data starts aligned. `QueryIRBuffer` with `owned: false` provides a view without taking ownership—the backing file mapping must outlive the `QueryIR`. + +**Deallocation**: When `owned: true`, `Drop` must reconstruct the exact `Layout` (size + 64-byte alignment) and call `std::alloc::dealloc`. Using `Box::from_raw` or similar would assume align=1 and cause undefined behavior. ### Segments @@ -56,6 +67,7 @@ Allocated via `Layout::from_size_align(len, BUFFER_ALIGN)`. Standard `Box<[u8]>` | Type Defs | `[TypeDef; T]` | `type_defs_offset` | 4 | | Type Members | `[TypeMember; U]` | `type_members_offset` | 2 | | Entrypoints | `[Entrypoint; V]` | `entrypoints_offset` | 4 | +| Ignored Kinds | `[NodeTypeId; W]` | `ignored_kinds_offset` | 2 | Each offset is aligned: `(offset + align - 1) & !(align - 1)`. @@ -103,7 +115,7 @@ struct Entrypoint { ### Serialization ``` -Header (48 bytes): +Header (64 bytes): magic: [u8; 4] b"PLNK" version: u32 format version + ABI hash checksum: u32 CRC32(offsets || buffer_data) @@ -116,10 +128,14 @@ Header (48 bytes): type_defs_offset: u32 type_members_offset: u32 entrypoints_offset: u32 + ignored_kinds_offset: u32 + _pad: [u8; 12] reserved, zero-filled Buffer Data (buffer_len bytes) ``` +Header is 64 bytes to ensure buffer data starts at a 64-byte aligned offset. This enables true zero-copy `mmap` usage where transitions at offset 0 within the buffer are correctly aligned. + Little-endian always. UTF-8 strings. Version mismatch or checksum failure → recompile. ### Construction diff --git a/docs/adr/ADR-0005-transition-graph-format.md b/docs/adr/ADR-0005-transition-graph-format.md index 1b003e95..da062d3a 100644 --- a/docs/adr/ADR-0005-transition-graph-format.md +++ b/docs/adr/ADR-0005-transition-graph-format.md @@ -1,7 +1,7 @@ # ADR-0005: Transition Graph Format - **Status**: Accepted -- **Date**: 2025-12-12 +- **Date**: 2024-12-12 - **Supersedes**: Parts of ADR-0003 ## Context @@ -38,22 +38,22 @@ struct Slice { ```rust #[repr(C, align(64))] struct Transition { - // --- 40 bytes metadata --- + // --- 32 bytes metadata --- matcher: Matcher, // 16 - pre_anchored: bool, // 1 - post_anchored: bool, // 1 + pre_nav: PreNav, // 2 (see ADR-0008) _pad1: [u8; 2], // 2 - pre_effects: Slice, // 8 - post_effects: Slice, // 8 + effects: Slice, // 8 ref_marker: RefTransition, // 4 - // --- 24 bytes control flow --- + // --- 32 bytes control flow --- successor_count: u32, // 4 - successor_data: [u32; 5], // 20 + successor_data: [u32; 7], // 28 } // 64 bytes, align 64 (cache-line aligned) ``` +Navigation is fully determined by `pre_nav`—no runtime dispatch based on previous matcher. See [ADR-0008](ADR-0008-tree-navigation.md) for `PreNav` definition and semantics. + Single `ref_marker` slot—sequences like `Enter(A) → Enter(B)` remain as epsilon chains. ### Inline Successors (SSO-style) @@ -62,18 +62,18 @@ Successors use a small-size optimization to avoid indirection for the common cas | `successor_count` | Layout | | ----------------- | ----------------------------------------------------------------------------------- | -| 0–5 | `successor_data[0..count]` contains `TransitionId` values directly | -| > 5 | `successor_data[0]` is index into `successors` segment, `successor_count` is length | +| 0–7 | `successor_data[0..count]` contains `TransitionId` values directly | +| > 7 | `successor_data[0]` is index into `successors` segment, `successor_count` is length | -Why 5 slots: 24 available bytes / 4 bytes per `TransitionId` = 6 slots, minus 1 for the count field leaves 5. +Why 7 slots: 32 available bytes / 4 bytes per `TransitionId` = 8 slots, minus 1 for the count field leaves 7. Coverage: - Linear sequences: 1 successor - Simple branches, quantifiers: 2 successors -- Most alternations: 2–5 branches +- Most alternations: 2–7 branches -Only massive alternations (6+ branches) spill to the external buffer. +Only massive alternations (8+ branches) spill to the external buffer. Cache benefits: @@ -98,14 +98,14 @@ enum Matcher { negated_fields: Slice, // 8 }, Wildcard, - Down, // cursor to first child - Up, // cursor to parent } // 16 bytes, align 4 ``` `Option` uses 0 for `None` (niche optimization). +Navigation (descend/ascend) is handled by `PreNav`, not matchers. Matchers are purely for node matching. + ### RefTransition ```rust @@ -118,6 +118,8 @@ enum RefTransition { // 4 bytes, align 2 ``` +Layout: 1-byte discriminant + 1-byte padding + 2-byte `RefId` payload = 4 bytes. Alignment is 2 (from `RefId: u16`). Fits comfortably in the 64-byte `Transition` struct with room to spare. + Explicit `None` ensures stable binary layout (`Option` niche is unspecified). ### Enter/Exit Semantics @@ -126,10 +128,12 @@ Explicit `None` ensures stable binary layout (`Option` niche is unspecifie **Solution**: Store return transitions at `Enter` time (in the call frame), retrieve at `Exit` time. O(1) exit, no filtering. -For `Enter(ref_id)` transitions, `successor_data` has special structure: +For `Enter(ref_id)` transitions, the **logical** successor list (accessed via `TransitionView::successors()`) has special structure: -- `successor_data[0]`: definition entry point (where to jump) -- `successor_data[1..count]`: return transitions (stored in call frame) +- `successors()[0]`: definition entry point (where to jump) +- `successors()[1..]`: return transitions (stored in call frame) + +This structure applies to the view, not raw `successor_data` memory. The SSO optimization (inline vs spilled storage) is orthogonal—the view abstracts it away. An `Enter` with 8+ returns spills to the external segment like any other transition; the interpreter accesses the logical list uniformly. For `Exit(ref_id)` transitions, successors are **ignored**. Return transitions come from the call frame pushed at `Enter`. See [ADR-0006](ADR-0006-dynamic-query-execution.md) for execution details. @@ -149,6 +153,7 @@ T11: ε + Exit(Func) successors=[] (ignored, returns from frame) ```rust #[repr(C, u16)] enum EffectOp { + CaptureNode, // store matched node as current value StartArray, PushElement, EndArray, @@ -162,19 +167,9 @@ enum EffectOp { // 4 bytes, align 2 ``` -No `CaptureNode`—implicit on successful match. - -### Effect Placement +`CaptureNode` is explicit—graph construction places it at the correct position relative to container effects. -| Effect | Placement | Why | -| -------------- | --------- | -------------------------- | -| `StartArray` | Pre | Container before elements | -| `StartObject` | Pre | Container before fields | -| `StartVariant` | Pre | Tag before payload | -| `PushElement` | Post | Consumes matched node | -| `Field` | Post | Consumes matched node | -| `End*` | Post | Finalizes after last match | -| `ToString` | Post | Converts matched node | +**Invariant**: The interpreter clears `matched_node` slot on `Enter` and backtrack restore. This prevents stale captures if a graph construction bug produces `Epsilon → CaptureNode` without a preceding `Match`. With proper graphs, `CaptureNode` always follows a successful match that populates the slot. ### View Types @@ -189,13 +184,15 @@ struct MatcherView<'a> { raw: &'a Matcher, } -enum MatcherKind { Epsilon, Node, Anonymous, Wildcard, Down, Up } +enum MatcherKind { Epsilon, Node, Anonymous, Wildcard } ``` Views resolve `Slice` to `&[T]`. `TransitionView::successors()` returns `&[TransitionId]`, hiding the inline/spilled distinction—callers see a uniform slice regardless of storage location. Engine code never touches offsets or `successor_data` directly. ### Quantifiers +Examples in this section show graph structure and effects. Navigation (`pre_nav`) is omitted for brevity—see [ADR-0008](ADR-0008-tree-navigation.md) for full transition examples with navigation. + **Greedy `*`**: ``` @@ -229,22 +226,22 @@ Query: `(parameters (identifier)* @params)` Before elimination: ``` -T0: ε + StartArray → [T1] -T1: ε (branch) → [T2, T4] -T2: Match(identifier) → [T3] -T3: ε + PushElement → [T1] -T4: ε + EndArray → [T5] -T5: ε + Field("params") → [...] +T0: ε [StartArray] → [T1] +T1: ε (branch) → [T2, T4] +T2: Match(identifier) → [T3] +T3: ε [CaptureNode, PushElement] → [T1] +T4: ε [EndArray] → [T5] +T5: ε [Field("params")] → [...] ``` After: ``` -T2': pre:[StartArray] Match(identifier) post:[PushElement] → [T2', T4'] -T4': post:[EndArray, Field("params")] → [...] +T2': Match(identifier) [StartArray, CaptureNode, PushElement] → [T2', T4'] +T4': ε [EndArray, Field("params")] → [...] ``` -First iteration gets `StartArray` from T0's path. Loop iterations skip it. +First iteration gets `StartArray` from T0's path. Loop iterations skip it. Note T4' remains epsilon—effects cannot merge into T2' without breaking semantics. ### Example: Object @@ -276,32 +273,34 @@ T6: ε + Field("val") + EndVariant → [T7] ### Epsilon Elimination -Partial—full elimination impossible due to single `ref_marker`. +Partial—full elimination impossible due to single `ref_marker` and effect ordering constraints. **Execution order** (all transitions, including epsilon): -1. Emit `pre_effects` -2. Execute matcher (epsilon always succeeds) -3. On success: emit implicit `CaptureNode`, emit `post_effects` +1. Execute `pre_nav` and matcher +2. On success: emit `effects` in order + +With explicit `CaptureNode`, effect order is unambiguous. When eliminating epsilon chains, concatenate effect lists in traversal order. + +**When epsilon nodes must remain**: -An epsilon transition with `pre: [StartObject]` and `post: [EndObject]` legitimately creates an empty object. To avoid accidental empty structures in graph rewrites, move effects to the destination's `pre` or source's `post` as appropriate. +1. **Ref markers**: A transition can hold at most one `Enter`/`Exit`. Sequences like `Enter(A) → Enter(B)` need epsilon. +2. **Branch points**: An epsilon with multiple successors cannot merge into predecessors without duplicating effects. +3. **Effect ordering conflicts**: When incoming and outgoing effects cannot be safely reordered. -Why pre/post split matters: +Example of safe elimination: ``` Before: -T1: Match(A) → [T2] // current = A -T2: ε + PushElement → [T3] // push A ✓ -T3: Match(B) → [...] // current = B +T1: Match(A) [CaptureNode] → [T2] +T2: ε [PushElement] → [T3] +T3: Match(B) [CaptureNode, Field("b")] → [...] -After (correct): -T3': pre:[PushElement] Match(B) // push A, then match B ✓ - -Wrong (no split): -T3': Match(B) post:[PushElement] // match B, push B ✗ +After: +T3': Match(B) [PushElement, CaptureNode, Field("b")] → [...] ``` -Incoming epsilon effects → `pre_effects`. Outgoing → `post_effects`. +`PushElement` consumes T1's captured value before T3 overwrites `current`. ## Consequences diff --git a/docs/adr/ADR-0006-dynamic-query-execution.md b/docs/adr/ADR-0006-dynamic-query-execution.md index f70f5fad..f14c15d3 100644 --- a/docs/adr/ADR-0006-dynamic-query-execution.md +++ b/docs/adr/ADR-0006-dynamic-query-execution.md @@ -1,7 +1,7 @@ # ADR-0006: Dynamic Query Execution - **Status**: Accepted -- **Date**: 2025-12-12 +- **Date**: 2024-12-12 - **Supersedes**: Parts of ADR-0003 ## Context @@ -14,24 +14,32 @@ Runtime interpretation of the transition graph ([ADR-0005](ADR-0005-transition-g For each transition: -1. Emit `pre_effects` -2. Match (epsilon always succeeds) -3. On success: emit `CaptureNode`, emit `post_effects` +1. Execute `pre_nav` initial movement (e.g., goto_first_child, goto_next_sibling) +2. Search loop: try matcher, on fail apply skip policy (advance or fail) +3. On match success: store matched node, execute `effects` sequentially 4. Process successors with backtracking +For `Up*` variants, step 2 becomes: validate exit constraint, ascend N levels (no search loop). + +Navigation is fully determined by `pre_nav`—no runtime dispatch based on previous matcher. See [ADR-0008](ADR-0008-tree-navigation.md) for detailed semantics. + +The matched node is stored in a temporary slot (`matched_node`) accessible to `CaptureNode` effect. Effects execute in order—`CaptureNode` reads from this slot and sets `executor.current`. + +**Slot invariant**: The `matched_node` slot is cleared (set to `None`) at the start of each transition execution, before `pre_nav`. This prevents stale captures if a transition path has `Epsilon → CaptureNode` without a preceding match—such a path indicates a graph construction bug, and the clear-on-entry invariant ensures it manifests as a predictable panic rather than silently capturing a wrong node. + ### Effect Stream ```rust struct EffectStream<'a> { - effects: Vec>, // append-only, backtrack via truncate -} - -enum RuntimeEffect<'a> { - Op(EffectOp), - CaptureNode(Node<'a>), // implicit on match, never in IR + ops: Vec, // effect log, backtrack via truncate + nodes: Vec>, // captured nodes, one per CaptureNode op } ``` +Effects are **recorded**, not eagerly executed. On match success, the transition's `effects` list is appended to `ops`. For each `CaptureNode`, the `matched_node` is also appended to `nodes`. + +On backtrack, both vectors truncate to their watermarks. On full match success, the executor replays `ops` sequentially, consuming from `nodes` for each `CaptureNode`. + ### Executor Converts effect stream to output value. @@ -57,18 +65,18 @@ enum Container<'a> { } ``` -| Effect | Action | -| ------------------- | ------------------------------------ | -| `CaptureNode(n)` | `current = Node(n)` | -| `StartArray` | push `Array([])` onto stack | -| `PushElement` | move `current` into top array | -| `EndArray` | pop array into `current` | -| `StartObject` | push `Object({})` onto stack | -| `Field(id)` | move `current` into top object field | -| `EndObject` | pop object into `current` | -| `StartVariant(tag)` | push `Variant(tag)` onto stack | -| `EndVariant` | pop, wrap `current`, set as current | -| `ToString` | replace `current` Node with text | +| Effect | Action | +| ------------------- | ----------------------------------------- | +| `CaptureNode` | `current = Node(nodes.next())` (consumes) | +| `StartArray` | push `Array([])` onto stack | +| `PushElement` | move `current` into top array | +| `EndArray` | pop array into `current` | +| `StartObject` | push `Object({})` onto stack | +| `Field(id)` | move `current` into top object field | +| `EndObject` | pop object into `current` | +| `StartVariant(tag)` | push `Variant(tag)` onto stack | +| `EndVariant` | pop, wrap `current`, set as current | +| `ToString` | replace `current` Node with text | Invalid state = IR bug → panic. @@ -78,7 +86,7 @@ Invalid state = IR bug → panic. struct Interpreter<'a> { query_ir: &'a QueryIR, backtrack_stack: BacktrackStack, - recursion_stack: RecursionStack, + frame_arena: CallFrameArena, cursor: TreeCursor<'a>, // created at tree root, never reset effects: EffectStream<'a>, } @@ -86,31 +94,35 @@ struct Interpreter<'a> { **Cursor constraint**: The cursor must be created once at the tree root and never call `reset()`. This preserves `descendant_index` validity for backtracking checkpoints. -Two stacks interact: backtracking can restore to a point inside a previously-exited call, so the recursion stack must preserve frames. +No `prev_matcher` tracking needed—each transition's `pre_nav` encodes the exact navigation to perform. + +Two stacks interact: backtracking can restore to a point inside a previously-exited call, so the frame arena must preserve frames. ### Backtracking ```rust struct BacktrackStack { points: Vec, + max_frame_watermark: Option, // highest frame index referenced by any point } struct BacktrackPoint { cursor_checkpoint: u32, // tree-sitter descendant_index effect_watermark: u32, recursion_frame: Option, // saved frame index - alternatives: Slice, // view into IR successors, not owned + transition_id: TransitionId, // source transition for alternatives + next_alt: u32, // index of next alternative to try } ``` -`alternatives` references the IR's successor data (inline or spilled)—no runtime allocation per backtrack point. +Alternatives are retrieved via `TransitionView::successors()[next_alt..]`. This avoids the `Slice` incompatibility with inline successors (SSO stores successors inside the `Transition` struct, not in the `Successors` segment). | Operation | Action | | --------- | ------------------------------------------------------ | | Save | `cursor_checkpoint = cursor.descendant_index()` — O(1) | | Restore | `cursor.goto_descendant(cursor_checkpoint)` — O(depth) | -Restore also truncates `effects` to `effect_watermark` and sets `recursion_stack.current` to `recursion_frame`. +Restore also truncates `effects` to `effect_watermark` and sets `frame_arena.current` to `recursion_frame`. ### Recursion @@ -119,18 +131,20 @@ Restore also truncates `effects` to `effect_watermark` and sets `recursion_stack **Solution**: Store returns in call frame at `Enter`, retrieve at `Exit`. O(1), no filtering. ```rust -struct RecursionStack { - frames: Vec, // append-only - current: Option, // index into frames, not depth +struct CallFrameArena { + frames: Vec, // append-only, pruned by watermark + current: Option, // index into frames (the "stack pointer") } struct CallFrame { parent: Option, // index of caller's frame ref_id: RefId, // verify Exit matches Enter - returns: Slice, // from Enter.successors()[1..] + enter_transition: TransitionId, // to retrieve returns via successors()[1..] } ``` +Returns are retrieved via `TransitionView::successors()[1..]` on the `enter_transition`. Same rationale as `BacktrackPoint`—avoids `Slice` incompatibility with inline successors. + **Append-only invariant**: Frames persist for backtracking correctness. On `Exit`, set `current` to parent index. Backtracking restores `current`; the original frame is still accessible via its index. **Frame pruning**: After `Exit`, frames at the stack top may be reclaimed if: @@ -140,7 +154,60 @@ struct CallFrame { This bounds memory by `max(recursion_depth, backtrack_depth)` rather than total call count. Without pruning, `(Rule)*` over N items allocates N frames; with pruning, it remains O(1) for non-backtracking iteration. -The `BacktrackPoint.recursion_frame` field establishes a "high-water mark"—the minimum frame index that must be preserved. Frames above this mark with no active reference can be popped. +**O(1) watermark tracking**: The `max_frame_watermark` is maintained incrementally: + +```rust +impl BacktrackStack { + fn push(&mut self, point: BacktrackPoint) { + if let Some(frame) = point.recursion_frame { + self.max_frame_watermark = Some(match self.max_frame_watermark { + Some(max) => max.max(frame), + None => frame, + }); + } + self.points.push(point); + } + + fn pop(&mut self) -> Option { + let point = self.points.pop()?; + // Recompute watermark only if popped point held the max + if point.recursion_frame == self.max_frame_watermark { + self.max_frame_watermark = self.points.iter() + .filter_map(|p| p.recursion_frame) + .max(); + } + Some(point) + }WS +} + +fn prune_high_water_mark( + current: Option, + backtrack_stack: &BacktrackStack, +) -> Option { + match (current, backtrack_stack.max_frame_watermark) { + (None, None) => None, + (Some(c), None) => Some(c), + (None, Some(m)) => Some(m), + (Some(c), Some(m)) => Some(c.max(m)), + } +} +``` + +Frames with index > high-water mark can be truncated. + +**Why not just check the last backtrack point?** Backtrack points are _not_ chronologically ordered by frame depth. After an Enter-Exit sequence, a new backtrack point may reference a shallower frame than earlier points: + +``` +1. Enter(A) → frames=[F0], current=0 +2. Save BP1 → BP1.recursion_frame = Some(0) +3. Exit(A) → current = None +4. Save BP2 → BP2.recursion_frame = None + +# BP2 is last, but BP1 still references F0 +# Checking only last point would incorrectly allow pruning F0 +``` + +The `max_frame_watermark` tracks the true maximum across all live points. Push is O(1). Pop is amortized O(1)—the O(n) rescan only triggers when popping the point that held the maximum, which can happen at most once per frame | Operation | Action | | ----------------- | ------------------------------------------------------------------------------ | @@ -199,7 +266,7 @@ Details deferred. ## Consequences -**Positive**: Append-only stacks make backtracking trivial. O(1) exit via stored returns. Two-phase separation is clean. +**Positive**: Append-only stacks make backtracking trivial. O(1) exit via stored returns. Navigation fully determined by `pre_nav`—no state tracking between transitions. **Negative**: Interpretation overhead. Recursion stack memory grows monotonically (bounded by `recursion_fuel`). @@ -208,3 +275,4 @@ Details deferred. - [ADR-0004: Query IR Binary Format](ADR-0004-query-ir-binary-format.md) - [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) - [ADR-0007: Type Metadata Format](ADR-0007-type-metadata-format.md) +- [ADR-0008: Tree Navigation](ADR-0008-tree-navigation.md) diff --git a/docs/adr/ADR-0008-tree-navigation.md b/docs/adr/ADR-0008-tree-navigation.md new file mode 100644 index 00000000..e70c53db --- /dev/null +++ b/docs/adr/ADR-0008-tree-navigation.md @@ -0,0 +1,336 @@ +# ADR-0008: Tree Navigation + +- **Status**: Accepted +- **Date**: 2025-01-13 + +## Context + +Plotnik's query execution engine ([ADR-0006](ADR-0006-dynamic-query-execution.md)) navigates tree-sitter syntax trees. This ADR covers: + +1. Which tree-sitter API to use (TreeCursor vs Node) +2. How `PreNav` encodes navigation and anchor constraints +3. How transitions execute navigation deterministically + +Key insight: navigation decisions can be resolved at graph construction time, not runtime. Each transition carries its own `PreNav` instruction—no need to track previous matcher state. + +## Decision + +### API Choice: TreeCursor with `descendant_index` Checkpoints + +```rust +struct InterpreterState<'tree> { + cursor: TreeCursor<'tree>, // created once at tree root, never reset +} + +struct BacktrackCheckpoint { + descendant_index: u32, // 4 bytes, O(1) save + // ... other state from ADR-0006 +} +``` + +**Critical constraint**: The cursor must be created at the tree root and never call `reset()`. The `descendant_index` is relative to the cursor's root—`reset(node)` invalidates all checkpoints. + +### PreNav + +Navigation and anchor constraints unified into a single enum: + +```rust +#[repr(C)] +struct PreNav { + kind: PreNavKind, // 1 byte + level: u8, // 1 byte - ascent level count for Up*, ignored otherwise +} +// 2 bytes total + +#[repr(u8)] +enum PreNavKind { + // No movement (first transition only, cursor at root) + Stay = 0, + + // Sibling traversal (horizontal) + Next = 1, // skip any nodes to find match + NextSkipTrivia = 2, // skip trivia only, fail if non-trivia skipped + NextExact = 3, // no skipping, current sibling must match + + // Enter children (descend) + Down = 4, // skip any among children + DownSkipTrivia = 5, // skip trivia only among children + DownExact = 6, // first child must match, no skip + + // Exit children (ascend) + Up = 7, // ascend `level` levels, no constraint + UpSkipTrivia = 8, // validate last non-trivia, ascend `level` levels + UpExact = 9, // validate last child, ascend `level` levels +} +``` + +For non-Up variants, `level` is ignored (conventionally 0). For Up variants, `level >= 1`. + +**Design note**: Multi-level `Up(n)` with n>1 is an optimization for the common case (no intermediate anchors). When anchors exist at intermediate nesting levels, decompose into separate `Up*` transitions at each level. + +### Trivia + +**Trivia** = anonymous nodes + language-specific ignored named nodes (e.g., `comment`). + +The ignored kinds list is populated from the `Lang` binding during IR construction and stored in the `ignored_kinds` segment ([ADR-0004](ADR-0004-query-ir-binary-format.md)). Zero offset means no ignored kinds. + +**Skip invariant**: A node is never skipped if its kind matches the current transition's matcher target. This ensures `(comment)` explicitly in a query still matches comment nodes, even though comments are typically ignored. + +### Execution Semantics + +Navigation and matching are intertwined in a search loop. The `PreNav` determines initial movement and skip policy for the loop. + +**Stay**: No cursor movement. Used only for the first transition when cursor is already positioned at root. Then attempt match. + +**Next variants**: Move to next sibling, enter search loop: + +- `Next`: Try match; on fail, advance to next sibling and retry; exhausted → fail +- `NextSkipTrivia`: Try match; on fail, if current node is non-trivia → fail, else advance and retry +- `NextExact`: Try match; on fail → fail (no retry) + +**Down variants**: Move to first child, enter search loop: + +- `Down`: Try match; on fail, advance to next sibling and retry; exhausted → fail +- `DownSkipTrivia`: Try match; on fail, if current node is non-trivia → fail, else advance and retry +- `DownExact`: Try match; on fail → fail (no retry) + +**Up variants**: Validate exit constraint, then ascend N levels (no search loop): + +- `Up`: No constraint, ascend +- `UpSkipTrivia`: Fail if non-trivia siblings follow current position, then ascend +- `UpExact`: Fail if any siblings follow current position, then ascend + +Example: `(foo (bar))` matching `(foo (foo) (foo) (bar))`: + +1. `[Down]` → goto_first_child (cursor at first `foo` child) +2. Try match `bar` → fail +3. Mode is `Down` (skip any) → goto_next_sibling (cursor at second `foo`) +4. Try match `bar` → fail +5. goto_next_sibling (cursor at `bar`) +6. Try match `bar` → success, exit loop + +### Skip Mode Symmetry + +| Mode | Entry/Search (Next/Down) | Exit (Up) | +| ---------- | --------------------------------------- | -------------------------------- | +| None | skip any nodes | no constraint on siblings | +| SkipTrivia | skip trivia, fail if non-trivia skipped | must be at last non-trivia child | +| Exact | no skip, immediate position | must be at last child | + +### Anchor Lowering + +The anchor operator (`.`) in the query language compiles to `PreNav` variants: + +| Query Pattern | PreNav on Following Transition | +| -------------------- | ------------------------------ | +| `(foo) (bar)` | `Next` | +| `(foo) . (bar)` | `NextSkipTrivia` | +| `"x" . (bar)` | `NextExact` | +| `(parent (child))` | `Down` on child's transition | +| `(parent . (child))` | `DownSkipTrivia` | +| `(parent (child) .)` | `UpSkipTrivia` on exit | +| `(parent "x" .)` | `UpExact` on exit | + +Mode determined by what **precedes** the anchor: + +| Precedes `.` | Mode | +| -------------------------------- | ---------- | +| Named node `(foo)`, wildcard `_` | SkipTrivia | +| String literal `"foo"` | Exact | +| Start of children (prefix `.`) | SkipTrivia | + +### Multi-Level Ascent + +Closing multiple nesting levels uses `Up` with a level count. For `(a (b (c (d))))`: + +``` +T3: [Down] Node(d) → T4 +T4: [Up level=3] Epsilon → Accept +``` + +When anchors exist at intermediate levels, decompose. For `(a (b (c) .) .)`: + +``` +T2: [Down] Node(c) → T3 +T3: [UpSkipTrivia] Epsilon → T4 // c must be last non-trivia in b +T4: [UpSkipTrivia] Epsilon → Accept // b must be last non-trivia in a +``` + +Cannot combine into `UpSkipTrivia(2)` because constraints apply at each level. + +### Execution Flow + +``` +1. MOVE pre_nav → initial cursor movement +2. SEARCH loop: try matcher, on fail check skip policy, advance or fail +3. EFFECTS on match success: execute effects list (including explicit CaptureNode) +``` + +For `Up*` variants, step 2 is replaced by: validate exit constraint, ascend N levels. + +No post-validation phase. Exit constraints are front-loaded into `Up*` variants. + +### Field Handling + +**Field constraints** are part of the match attempt within the search loop. A node that doesn't satisfy field constraints is treated as a match failure, triggering the skip policy: + +```rust +// Inside search loop, before structural match: +if let Some(required) = pattern.field { + if cursor.field_id() != Some(required) { + // Field mismatch = match fail, apply skip policy + continue; + } +} +// Then check node kind, negated fields, etc. +``` + +**Negated fields** are also part of match—checked after field/kind match succeeds: + +```rust +// After node kind matches: +for &fid in pattern.negated_fields { + if node.child_by_field_id(fid).is_some() { + // Negated field present = match fail, apply skip policy + continue; + } +} +// Match succeeds, exit search loop +``` + +### Examples + +**Simple**: `(function (identifier) @name)` + +``` +T0: [Stay] Node(function) → T1 +T1: [Down] Node(identifier) [CaptureNode] → T2 +T2: [Up] Epsilon [Field("name")] → Accept +``` + +**Anchored first child**: `(function . (identifier))` + +``` +T0: [Stay] Node(function) → T1 +T1: [DownSkipTrivia] Node(identifier) → T2 +T2: [Up] Epsilon → Accept +``` + +**Anchored last child**: `(function (identifier) .)` + +``` +T0: [Stay] Node(function) → T1 +T1: [Down] Node(identifier) → T2 +T2: [UpSkipTrivia] Epsilon → Accept +``` + +**Siblings**: `(block (stmt) (stmt))` + +``` +T0: [Stay] Node(block) → T1 +T1: [Down] Node(stmt) → T2 +T2: [Next] Node(stmt) → T3 +T3: [Up] Epsilon → Accept +``` + +**Adjacent siblings**: `(block (stmt) . (stmt))` + +``` +T0: [Stay] Node(block) → T1 +T1: [Down] Node(stmt) → T2 +T2: [NextSkipTrivia] Node(stmt) → T3 +T3: [Up] Epsilon → Accept +``` + +**Deep nesting**: `(a (b (c (d))))` + +``` +T0: [Stay] Node(a) → T1 +T1: [Down] Node(b) → T2 +T2: [Down] Node(c) → T3 +T3: [Down] Node(d) → T4 +T4: [Up level=3] Epsilon → Accept +``` + +**Mixed anchors**: `(a (b) . (c) .)` + +``` +T0: [Stay] Node(a) → T1 +T1: [Down] Node(b) → T2 +T2: [NextSkipTrivia] Node(c) → T3 // . before (c): adjacent to b +T3: [UpSkipTrivia] Epsilon → Accept // . after (c): c is last non-trivia +``` + +**Intermediate anchor**: `(foo (foo (bar) .)) (baz)` + +``` +T0: [Stay] Node(foo_outer) → T1 +T1: [Down] Node(foo_inner) → T2 +T2: [Down] Node(bar) → T3 +T3: [UpSkipTrivia] Epsilon → T4 // bar must be last non-trivia in foo_inner +T4: [Up] Epsilon → T5 // no constraint on foo_inner in foo_outer +T5: [Next] Node(baz) → Accept +``` + +## Alternatives Considered + +### Pure Node API + +Rejected: `next_sibling()` is O(siblings), no efficient backtracking. + +### Cursor Cloning + +Rejected: `TreeCursor::clone()` heap-allocates, O(depth) memory per checkpoint. + +### Runtime Navigation Dispatch + +Previous design used `(prev_matcher, curr_matcher)` pairs to determine movement at runtime. Rejected: + +- Required tracking `prev_matcher` in interpreter state and backtrack checkpoints +- Complex dispatch table +- Navigation decisions can be resolved at compile time + +### Separate Post-Anchor Validation + +Previous design had `post_anchor` field validated after match. Rejected: + +- Extra phase in execution loop +- Exit constraints naturally encode as `Up*` variants +- "Must be last child" is validated before ascending, not after matching + +## Complexity Analysis + +| Operation | Cursor | Node | +| ----------------------- | ------------ | ----------- | +| `goto_first_child()` | O(1) | — | +| `goto_next_sibling()` | O(1) | O(siblings) | +| `goto_parent()` | O(1) | O(1) | +| `field_id()` | O(field_map) | — | +| `child_by_field_id(id)` | — | O(children) | +| `descendant_index()` | O(1) | — | +| `goto_descendant(idx)` | O(depth) | — | + +- Checkpoint save: O(1) +- Checkpoint restore: O(depth)—cold path only + +## Consequences + +**Positive**: + +- O(1) sibling traversal +- 4-byte checkpoints +- No `prev_matcher` tracking—navigation fully determined by `PreNav` +- Simpler execution loop: navigate → search → match (no post-validation) +- Anchor constraints resolved at graph construction time + +**Negative**: + +- Single cursor constraint requires careful state management +- O(depth) restore cost on backtrack +- Intermediate anchors prevent multi-level `Up(n)` optimization + +## References + +- [ADR-0005: Transition Graph Format](ADR-0005-transition-graph-format.md) +- [ADR-0006: Dynamic Query Execution](ADR-0006-dynamic-query-execution.md) +- `tree-sitter/lib/src/tree_cursor.c`