From e75fd07e7c970663f4bdb40f4210d073d9ecb79a Mon Sep 17 00:00:00 2001 From: Sergei Zharinov Date: Sun, 21 Dec 2025 12:25:05 -0300 Subject: [PATCH] docs: Update type system --- AGENTS.md | 42 +- crates/plotnik-cli/docs/lang-reference.md | 835 ---------------------- docs/binary-format/04-types.md | 6 +- docs/binary-format/06-transitions.md | 100 ++- docs/lang-reference.md | 156 ++-- docs/runtime-engine.md | 40 +- docs/type-system.md | 328 +++++---- 7 files changed, 411 insertions(+), 1096 deletions(-) delete mode 100644 crates/plotnik-cli/docs/lang-reference.md diff --git a/AGENTS.md b/AGENTS.md index 9ef1a9bb..140b7883 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -26,7 +26,7 @@ | `!field` | Negated field (assert absent) | | `?` `*` `+` | Quantifiers (0-1, 0+, 1+) | | `??` `*?` `+?` | Non-greedy variants | -| `.` | Anchor (adjacency) | +| `.` | Anchor (adjacency, see below) | | `{...}` | Sequence (siblings in order) | | `[...]` | Alternation (first match wins) | | `Name = ...` | Named definition (entrypoint) | @@ -36,7 +36,9 @@ - Captures are flat by default: nesting in pattern ≠ nesting in output - `{...} @x` or `[...] @x` creates a nested scope -- Quantifier on captured pattern → array: `(x)* @a` → `a: T[]` +- Scalar list (no internal captures): `(x)* @a` → `a: T[]` +- Row list (with internal captures): `{(x) @x}* @rows` → `rows: { x: T }[]` +- **Strict dimensionality**: `*`/`+` with internal captures requires row capture ## Alternations @@ -68,6 +70,18 @@ Labeled (tagged union): Nested = (call function: [(id) @name (Nested) @inner]) ``` +## Anchor Strictness + +The `.` anchor adapts to what it's anchoring: + +| Pattern | Behavior | +| ----------- | ------------------------------------------- | +| `(a) . (b)` | Skip trivia, no named nodes between | +| `"x" . (b)` | Strict—nothing between (anonymous involved) | +| `(a) . "x"` | Strict—nothing between (anonymous involved) | + +Rule: anchor is as strict as its strictest operand. + ## Anti-patterns ``` @@ -81,21 +95,21 @@ Nested = (call function: [(id) @name (Nested) @inner]) (id) @x (#eq? @x "foo") ``` -## Type System Gotchas +## Type System Rules -**Columnar output**: Quantifiers produce parallel arrays, not list of objects: +**Strict dimensionality**: Quantifiers with internal captures require explicit row capture: ``` -{(A) @a (B) @b}* → { a: Node[], b: Node[] } // NOT [{a,b}, {a,b}] +{(a) @a (b) @b}* ; ERROR: internal captures, no row capture +{(a) @a (b) @b}* @rows ; OK: rows: { a: Node, b: Node }[] +(func (id) @name)* ; ERROR: internal capture without row +{(func (id) @name) @f}* @funcs ; OK: funcs: { f: Node, name: Node }[] ``` -For list of objects, wrap in sequence: `({(A) @a (B) @b} @row)*` - -**Row integrity**: Can't mix `*`/`+` with `1`/`?` in same quantified scope: +**Optional bubbling**: `?` does NOT require row capture (no dimensionality added): ``` -{(A)* @a (B) @b}* ; ERROR: @a desync, @b sync -{(A)? @a (B) @b}* ; OK: both synchronized (? emits null) +{(a) @a (b) @b}? ; OK: a?: Node, b?: Node (bubbles to parent) ``` **Recursion rules**: @@ -110,9 +124,7 @@ A = (foo (B)) B = (bar (A)) ; OK: descends each step ## ⚠️ Sequence Syntax (Tree-sitter vs Plotnik) -Tree-sitter: `((a) (b))` — Plotnik: `{(a) (b)}`. The #1 syntax mistake. - -`((a) (b))` in Plotnik means "node `(a)` with child `(b)`", NOT a sequence. +Tree-sitter: `((a) (b))` — Plotnik: `{(a) (b)}`. The #1 syntax error. # Architecture Decision Records (ADRs) @@ -122,10 +134,6 @@ Tree-sitter: `((a) (b))` — Plotnik: `{(a) (b)}`. The #1 syntax mistake. - _(no ADRs yet)_ - **Template**: -[ADR-0001](docs/adr/ADR-0001-query-parser.md) | [ADR-0002](docs/adr/ADR-0002-diagnostics-system.md) | [ADR-0004](docs/adr/ADR-0004-query-ir-binary-format.md) | [ADR-0005](docs/adr/ADR-0005-transition-graph-format.md) | [ADR-0006](docs/adr/ADR-0006-dynamic-query-execution.md) | [ADR-0007](docs/adr/ADR-0007-type-metadata-format.md) | [ADR-0008](docs/adr/ADR-0008-tree-navigation.md) | [ADR-0009](docs/adr/ADR-0009-type-system.md) | [ADR-0010](docs/adr/ADR-0010-type-system-v2.md) | [ADR-0012](docs/adr/ADR-0012-variable-length-ir.md) - -## Template - ```markdown # ADR-XXXX: Title diff --git a/crates/plotnik-cli/docs/lang-reference.md b/crates/plotnik-cli/docs/lang-reference.md deleted file mode 100644 index e201a752..00000000 --- a/crates/plotnik-cli/docs/lang-reference.md +++ /dev/null @@ -1,835 +0,0 @@ -# Plotnik Query Language Reference - -Plotnik is a pattern-matching language for tree-sitter syntax trees. It extends [tree-sitter's query syntax](https://tree-sitter.github.io/tree-sitter/using-parsers/queries/1-syntax.html) with named expressions, recursion, and static type inference. - -Predicates (`#eq?`, `#match?`) and directives (`#set!`) are intentionally unsupported—filtering logic belongs in your host language. - ---- - -## Execution Model - -NFA-based cursor walk with backtracking. - -### Key Properties - -- **Root-anchored**: Matches the entire tree structure (like `^...$` in regex) -- **Backtracking**: Failed branches restore state and try alternatives -- **Ordered choice**: `[A B C]` tries branches left-to-right; first match wins - -### Trivia Handling - -Comments and "extra" nodes (per tree-sitter grammar) are automatically skipped unless explicitly matched. - -```plotnik/docs/lang-reference.md#L24-24 -(function_declaration (identifier) @name (block) @body) -``` - -Matches even with comments between children: - -```plotnik/docs/lang-reference.md#L28-31 -function foo /* comment */() { - /* body */ -} -``` - -The `.` anchor enforces strict adjacency: - -```plotnik/docs/lang-reference.md#L35-35 -(array . (identifier) @first) ; must be immediately after bracket -``` - -### Partial Matching - -Node patterns are open—unmentioned children are ignored: - -```plotnik/docs/lang-reference.md#L46-46 -(binary_expression left: (identifier) @left) -``` - -Matches any `binary_expression` with an `identifier` in `left`, regardless of `operator`, `right`, etc. - -Sequences `{...}` advance through siblings in order, skipping non-matching nodes. - -### Field Constraints - -`field: pattern` requires the child to have that field AND match the pattern: - -```plotnik/docs/lang-reference.md#L58-61 -(binary_expression - left: (identifier) @x - right: (number) @y -) -``` - -Fields participate in sequential matching—they're not independent lookups. - ---- - -## File Structure - -A `.ptk` file contains definitions: - -```plotnik/docs/lang-reference.md#L78-82 -; Internal (mixin/fragment) -Expr = [(identifier) (number) (string)] - -; Public entrypoint -pub Stmt = (statement) @stmt -``` - -### Visibility - -| Syntax | Role | In Binary | -| --------------- | ----------------- | --------- | -| `Def = ...` | Internal mixin | No | -| `pub Def = ...` | Public entrypoint | Yes | - -Internal definitions exist only to support `pub` definitions. - -### Script vs Module Mode - -**Script** (`-q` flag): Anonymous expressions allowed, auto-wrapped in language root. - -```sh -plotnik exec -q '(identifier) @id' -s app.js -``` - -**Module** (`.ptk` files): Only named definitions allowed. - -```plotnik/docs/lang-reference.md#L106-110 -; ERROR in .ptk file -(identifier) @id - -; OK -pub Query = (identifier) @id -``` - ---- - -## Workspace - -A directory of `.ptk` files loaded as a single compilation unit. - -### Properties - -- **Flat namespace**: `Foo` in `a.ptk` visible in `b.ptk` without imports -- **Global uniqueness**: Duplicate names are errors -- **Non-recursive**: Subdirectories are separate workspaces -- **Dead code elimination**: Unreachable internals stripped - -### Language Inference - -Inferred from directory name (`queries.ts/` → TypeScript, `java-checks/` → Java). Override with `-l/--lang`. - -### Execution - -- Single `pub`: Default entrypoint -- Multiple `pub`: Use `--entry ` -- No `pub`: Compilation error - -### Example - -`helpers.ptk`: - -```plotnik/docs/lang-reference.md#L147-153 -Ident = (identifier) - -DeepSearch = [ - (Ident) @target - (_ (DeepSearch)*) -] -``` - -`main.ptk`: - -```plotnik/docs/lang-reference.md#L157-158 -pub AllIdentifiers = (program (DeepSearch)*) -``` - ---- - -## Naming Conventions - -| Kind | Case | Examples | -| -------------------------- | ------------ | ------------------------------------ | -| Definitions, labels, types | `PascalCase` | `Expr`, `Statement`, `BinaryOp` | -| Node kinds | `snake_case` | `function_declaration`, `identifier` | -| Captures, fields | `snake_case` | `@name`, `@func_body` | - -Tree-sitter allows `@function.name`; Plotnik requires `@function_name` because captures map to struct fields. - ---- - -## Data Model - -Plotnik infers output types from your query. The key rule may surprise you—but it's intentional for schema stability. - -### Flat by Default - -Query nesting does NOT create output nesting. All captures become fields in a single flat record. - -**Why?** Adding a new `@capture` to an existing query shouldn't break downstream code using other captures. Flat output makes capture additions non-breaking. See [Type System](type-system.md#design-philosophy) for the full rationale. - -``` -(function_declaration - name: (identifier) @name - body: (block - (return_statement (expression) @retval))) -``` - -Output type: - -```typescript -{ name: Node, retval: Node } // flat, not nested -``` - -The pattern is 4 levels deep, but the output is flat. This is intentional: you're usually extracting specific pieces from an AST, not reconstructing its shape. - -### The Node Type - -Default capture type—a reference to a tree-sitter node: - -```plotnik/docs/lang-reference.md#L205-210 -interface Node { - kind: string; // e.g. "identifier" - text: string; // source text - start: Position; // { row, column } - end: Position; -} -``` - -### Cardinality: Quantifiers → Arrays - -Quantifiers on the captured pattern determine whether a field is singular, optional, or an array: - -| Pattern | Output Type | Meaning | -| --------- | ---------------- | ------------ | -| `(x) @a` | `a: T` | exactly one | -| `(x)? @a` | `a?: T` | zero or one | -| `(x)* @a` | `a: T[]` | zero or more | -| `(x)+ @a` | `a: [T, ...T[]]` | one or more | - -### Creating Nested Structure - -Capture a sequence `{...}` or alternation `[...]` to create a new scope. Braces alone don't introduce nesting: - -``` -{ - (function_declaration - name: (identifier) @name - body: (_) @body - ) @node -} @func -``` - -Output type: - -```typescript -{ func: { node: Node, name: Node, body: Node } } -``` - -The `@func` capture on the group creates a nested scope. All captures inside (`@node`, `@name`, `@body`) become fields of that nested object. - -### Type Annotations - -`::` after a capture controls the output type: - -| Annotation | Effect | -| -------------- | ----------------------------- | -| `@x` | Inferred (usually `Node`) | -| `@x :: string` | Extract `node.text` as string | -| `@x :: T` | Name the type `T` in codegen | - -Only `:: string` changes data; other `:: T` affect only generated type names. - -Example: - -``` -{ - (function_declaration - name: (identifier) @name :: string - body: (_) @body - ) @node -} @func :: FunctionDeclaration -``` - -Output type: - -```typescript -interface FunctionDeclaration { - node: Node; - name: string; // :: string converted this - body: Node; -} - -{ - func: FunctionDeclaration; -} -``` - -### Summary - -| Pattern | Output | -| ----------------------- | ------------------------- | -| `@name` | Field in current scope | -| `(x)? @a` | Optional field | -| `(x)* @a` | Array field | -| `{...} @x` / `[...] @x` | Nested object (new scope) | -| `@x :: string` | String value | -| `@x :: T` | Custom type name | - ---- - -## Nodes - -### Named Nodes - -Match named nodes (non-terminals and named terminals) by type: - -``` -(function_declaration) -(binary_expression (identifier) (number)) -``` - -Children can be partial—this matches any `binary_expression` with at least one `string_literal` child: - -``` -(binary_expression (string_literal)) -``` - -With captures: - -``` -(binary_expression - (identifier) @left - (number) @right) -``` - -Output type: - -```typescript -{ left: Node, right: Node } -``` - -### Anonymous Nodes - -Match literal tokens (operators, keywords, punctuation) with double or single quotes: - -``` -(binary_expression operator: "!=") -(return_statement "return") -``` - -Single quotes are equivalent to double quotes, useful when the query itself is wrapped in double quotes (e.g., in tool calls or JSON): - -``` -(return_statement 'return') -``` - -Anonymous nodes can be captured directly: - -``` -(binary_expression "+" @op) -"return" @keyword -``` - -Output type: - -```typescript -{ - op: Node; -} -{ - keyword: Node; -} -``` - -### Wildcards - -| Syntax | Matches | -| ------ | ----------------------------- | -| `(_)` | Any named node | -| `_` | Any node (named or anonymous) | - -```plotnik/docs/lang-reference.md#L370-371 -(call_expression function: (_) @fn) -(pair key: _ @key value: _ @value) -``` - -### Special Nodes - -- `(ERROR)` — matches parser error nodes -- `(MISSING)` — matches nodes inserted by error recovery -- `(MISSING identifier)` — matches a specific missing node type -- `(MISSING ";")` — matches a missing anonymous node - -``` -(ERROR) @syntax_error -(MISSING ";") @missing_semicolon -``` - -Output type: - -```typescript -{ - syntax_error: Node; -} -{ - missing_semicolon: Node; -} -``` - -### Supertypes - -Query abstract node types directly, or narrow with `/`: - -```plotnik/docs/lang-reference.md#L406-409 -(expression) @expr -(expression/binary_expression) @binary -(expression/"()") @empty_parens -``` - ---- - -## Fields - -Constrain children to named fields. A field value must be a node pattern, an alternation, or a quantifier applied to one of these. Groups `{...}` are not allowed as direct field values. - -``` -(assignment_expression - left: (identifier) @target - right: (call_expression) @value) -``` - -Output type: - -```typescript -{ target: Node, value: Node } -``` - -With type annotations: - -``` -(assignment_expression - left: (identifier) @target :: string - right: (call_expression) @value) -``` - -Output type: - -```typescript -{ target: string, value: Node } -``` - -### Negated Fields - -Assert a field is absent with `!`: - -``` -(function_declaration - name: (identifier) @name - !type_parameters) -``` - -Negated fields don't affect the output type—they're purely structural constraints: - -```typescript -{ - name: Node; -} -``` - ---- - -## Quantifiers - -- `?` — zero or one (optional) -- `*` — zero or more -- `+` — one or more (non-empty) - -``` -(function_declaration (decorator)? @decorator) -(function_declaration (decorator)* @decorators) -(function_declaration (decorator)+ @decorators) -``` - -Output types: - -```typescript -{ decorator?: Node } -{ decorators: Node[] } -{ decorators: [Node, ...Node[]] } -``` - -The `+` quantifier always produces non-empty arrays—no opt-out. - -Plotnik also supports non-greedy variants: `*?`, `+?`, `??` - ---- - -## Sequences - -Match sibling patterns in order with braces. - -> **⚠️ Syntax Difference from Tree-sitter** -> -> Tree-sitter: `((a) (b))` — parentheses for sequences -> Plotnik: `{(a) (b)}` — braces for sequences -> -> This avoids ambiguity: `(foo)` is always a node, `{...}` is always a sequence. -> Using tree-sitter's `((a) (b))` syntax in Plotnik is a parse error. - -Plotnik uses `{...}` to visually distinguish grouping from node patterns, and adds scope creation when captured (`{...} @name`). - -``` -{ - (comment) - (function_declaration) -} -``` - -Quantifiers apply to sequences: - -``` -{ - (number) - {"," (number)}* -} -``` - -### Sequences with Captures - -Capture elements inside a sequence: - -``` -{ - (decorator)* @decorators - (function_declaration) @fn -} -``` - -Output type: - -```typescript -{ decorators: Node[], fn: Node } -``` - -Capture the entire sequence with a type name: - -``` -{ - (comment)+ - (function_declaration) @fn -}+ @sections :: Section -``` - -Output type: - -```typescript -interface Section { - fn: Node; -} - -{ sections: [Section, ...Section[]] } -``` - ---- - -## Alternations - -Match alternatives with `[...]`: - -- **Untagged**: Fields merge across branches -- **Tagged** (with labels): Discriminated union - -```plotnik/docs/lang-reference.md#L570-573 -[ - (identifier) - (string_literal) -] @value -``` - -### Merge Style (Unlabeled) - -Captures merge: present in all branches → required; some branches → optional. Same-name captures must have compatible types. - -Branches must be type-compatible. Bare nodes are auto-promoted to single-field structs when mixed with structured branches. - -``` -(statement - [ - (assignment_expression left: (identifier) @left) - (call_expression function: (identifier) @func) - ]) -``` - -Output type: - -```typescript -{ left?: Node, func?: Node } // each appears in one branch only -``` - -When the same capture appears in all branches: - -``` -[ - (identifier) @name - (string) @name -] -``` - -Output type: - -```typescript -{ - name: Node; -} // required: present in all branches, same type -``` - -Mixed presence: - -``` -[ - (binary_expression - left: (_) @x - right: (_) @y) - (identifier) @x -] -``` - -The second branch `(identifier) @x` is auto-promoted to a structure `{ x: Node }`, making it compatible with the first branch. - -Output type: - -```typescript -{ x: Node, y?: Node } // x in all branches (required), y in one (optional) -``` - -Type mismatch is an error: - -``` -[(identifier) @x :: string (number) @x :: number] // ERROR: @x has different types -``` - -With a capture on the alternation itself, the type is non-optional since exactly one branch must match: - -``` -[ - (identifier) - (number) -] @value -``` - -Output type: - -```typescript -{ - value: Node; -} -``` - -### Tagged Style (Labeled) - -Labels create a discriminated union (`$tag` + `$data`): - -```plotnik/docs/lang-reference.md#L657-660 -[ - Assign: (assignment_expression left: (identifier) @left) - Call: (call_expression function: (identifier) @func) -] @stmt :: Stmt -``` - -```plotnik/docs/lang-reference.md#L664-667 -type Stmt = - | { $tag: "Assign"; $data: { left: Node } } - | { $tag: "Call"; $data: { func: Node } }; -``` - -### Alternations with Type Annotations - -When a merge alternation produces a structure (branches have internal captures), the capture on the alternation must have an explicit type annotation for codegen: - -``` -(call_expression - function: [ - (identifier) @fn - (member_expression property: (property_identifier) @method) - ] @target :: Target) -``` - -Output type: - -```typescript -interface Target { - fn?: Node; - method?: Node; -} - -{ - target: Target; -} -``` - ---- - -## Anchors - -The anchor `.` constrains sibling positions. Anchors don't affect types—they're structural constraints. - -First child: - -``` -(array . (identifier) @first) -``` - -Last child: - -``` -(block (_) @last .) -``` - -Immediate adjacency: - -``` -(dotted_name (identifier) @a . (identifier) @b) -``` - -Without the anchor, `@a` and `@b` would match non-adjacent pairs too. - -Output type for all examples: - -```typescript -{ first: Node } -{ last: Node } -{ a: Node, b: Node } -``` - -Anchors ignore anonymous nodes. - ---- - -## Named Expressions - -Define reusable patterns: - -```plotnik/docs/lang-reference.md#L744-748 -BinaryOp = - (binary_expression - left: (_) @left - operator: _ @op - right: (_) @right) -``` - -Use as node types: - -```plotnik/docs/lang-reference.md#L752-752 -(return_statement (BinaryOp) @expr) -``` - -**Encapsulation**: `(Name)` matches but extracts nothing. You must capture (`(Name) @x`) to access fields. This separates structural reuse from data extraction. - -Named expressions define both pattern and type: - -```plotnik/docs/lang-reference.md#L764-764 -Expr = [(BinaryOp) (UnaryOp) (identifier) (number)] -``` - ---- - -## Recursion - -Named expressions can self-reference: - -```plotnik/docs/lang-reference.md#L794-798 -NestedCall = - (call_expression - function: [(identifier) @name (NestedCall) @inner] - arguments: (arguments)) -``` - -Matches `a()`, `a()()`, `a()()()`, etc. → `{ name?: Node, inner?: NestedCall }` - -Tagged recursive example: - -```plotnik/docs/lang-reference.md#L810-815 -MemberChain = [ - Base: (identifier) @name - Access: (member_expression - object: (MemberChain) @object - property: (property_identifier) @property) -] -``` - ---- - -## Full Example - -``` -Statement = [ - Assign: (assignment_expression - left: (identifier) @target :: string - right: (Expression) @value) - Call: (call_expression - function: (identifier) @func :: string - arguments: (arguments (Expression)* @args)) - Return: (return_statement - (Expression)? @value) -] - -Expression = [ - Ident: (identifier) @name :: string - Num: (number) @value :: string - Str: (string) @value :: string -] - -(program (Statement)+ @statements) -``` - -Output types: - -```typescript -type Statement = - | { $tag: "Assign"; $data: { target: string; value: Expression } } - | { $tag: "Call"; $data: { func: string; args: Expression[] } } - | { $tag: "Return"; $data: { value?: Expression } }; - -type Expression = - | { $tag: "Ident"; $data: { name: string } } - | { $tag: "Num"; $data: { value: string } } - | { $tag: "Str"; $data: { value: string } }; - -type Root = { - statements: [Statement, ...Statement[]]; -}; -``` - ---- - -## Quick Reference - -| Feature | Tree-sitter | Plotnik | -| -------------------- | ---------------- | ------------------------- | -| Capture | `@name` | `@name` (snake_case only) | -| Type annotation | | `@x :: T` | -| Text extraction | | `@x :: string` | -| Named node | `(type)` | `(type)` | -| Anonymous node | `"text"` | `"text"` | -| Any node | `_` | `_` | -| Any named node | `(_)` | `(_)` | -| Field constraint | `field: pattern` | `field: pattern` | -| Negated field | `!field` | `!field` | -| Quantifiers | `?` `*` `+` | `?` `*` `+` | -| Non-greedy | | `??` `*?` `+?` | -| Sequence | `((a) (b))` | `{(a) (b)}` | -| Alternation | `[a b]` | `[a b]` | -| Tagged alternation | | `[A: (a) B: (b)]` | -| Anchor | `.` | `.` | -| Named expression | | `Name = pattern` | -| Public entrypoint | | `pub Name = pattern` | -| Use named expression | | `(Name)` | - ---- - -## Diagnostics - -Priority-based suppression: when diagnostics overlap, lower-priority ones are hidden. You see the root cause, not cascading symptoms. diff --git a/docs/binary-format/04-types.md b/docs/binary-format/04-types.md index 84153bdd..fb3d0d83 100644 --- a/docs/binary-format/04-types.md +++ b/docs/binary-format/04-types.md @@ -39,7 +39,7 @@ The **TypeMeta** section contains two contiguous arrays: 1. **Definitions**: `[TypeDef; header.type_defs_count]` 2. **Members**: `[TypeMember; header.type_members_count]` -Both `header.type_members_count` and `Slice.ptr` are `u16`, so the addressable range (0..65535) is identical—no capacity mismatch is possible by construction. +**Validation**: For `Struct`/`Enum` kinds, loaders must verify: `(ptr as u32) + (len as u32) ≤ type_members_count`. This prevents out-of-bounds reads from malformed binaries (e.g., `ptr=65000, len=1000` overflows u16 arithmetic). ### 2.1. TypeDef (8 bytes) @@ -104,10 +104,10 @@ Recursive types reference themselves via TypeId. Since types are addressed by in Example query: -```plotnik +``` List = [ Nil: (nil) - Cons: (cons (T) @head (List) @tail) + Cons: (cons (a) @head (List) @tail) ] ``` diff --git a/docs/binary-format/06-transitions.md b/docs/binary-format/06-transitions.md index bd49518f..41dafaae 100644 --- a/docs/binary-format/06-transitions.md +++ b/docs/binary-format/06-transitions.md @@ -67,8 +67,8 @@ EffectOp (u16) └──────────────┴─────────────────────┘ ``` -- **Opcode**: 6 bits (0-63), currently 13 defined -- **Payload**: 10 bits (0-1023), member/variant index +- **Opcode**: 6 bits (0-63), currently 12 defined +- **Payload**: 10 bits (0-1023), member/variant index. Limits struct/enum members to 1024. | Opcode | Name | Payload (10b) | | :----- | :------------- | :--------------------- | @@ -79,15 +79,50 @@ EffectOp (u16) | 4 | `StartObject` | - | | 5 | `EndObject` | - | | 6 | `SetField` | Member index (0-1023) | -| 7 | `PushField` | Member index (0-1023) | -| 8 | `StartVariant` | Variant index (0-1023) | -| 9 | `EndVariant` | - | -| 10 | `ToString` | - | -| 11 | `ClearCurrent` | - | -| 12 | `PushNull` | - | +| 7 | `StartVariant` | Variant index (0-1023) | +| 8 | `EndVariant` | - | +| 9 | `ToString` | - | +| 10 | `ClearCurrent` | - | +| 11 | `PushNull` | - | + +**Object vs Scalar List Context**: + +The VM builds **Array of Structs** (AoS), not Structure of Arrays (SoA). This affects opcode usage: + +- **Scalar lists** (`(x)* @items`): `StartArray` → loop(`CaptureNode`, `PushElement`) → `EndArray`, `SetField` +- **Row lists** (`{ (x) @x }* @rows`): `StartArray` → loop(`StartObject`, `CaptureNode`, `SetField`, `EndObject`, `PushElement`) → `EndArray`, `SetField` + +Arrays are built on a value stack and assigned to fields via `SetField`. + +`PushNull` emits explicit null values for: + +- Optional fields when the optional branch is skipped +- Alternation branches missing a capture present in other branches Member/variant indices are resolved via `type_members[struct_or_enum.members.start + index]`. +### Opcode Ranges (Future Extensibility) + +Opcodes are partitioned by argument size: + +| Range | Format | Payload | +| :---- | :---------- | :----------------------------- | +| 0-31 | Single word | 10-bit payload in same word | +| 32-63 | Extended | Next u16 word is full argument | + +Current opcodes (0-11) fit in the single-word range. Future predicates needing `StringId` (u16) use extended format: + +``` +// Single word (current) +SetField: [opcode=6 | member_idx] + +// Extended (future) +AssertEqText: [opcode=32 | reserved], [StringId] +AssertMatch: [opcode=33 | flags], [RegexId] +``` + +This maintains backwards compatibility—existing binaries use only opcodes < 32. + ## 4. Instructions All instructions are exactly 8 bytes. @@ -96,7 +131,7 @@ All instructions are exactly 8 bytes. **Epsilon Transitions**: A `MatchExt` with `node_type: None`, `node_field: None`, and `nav: Stay` is an **epsilon transition**—it succeeds unconditionally without cursor interaction. This is critical for: -- **Branching at EOF**: `(A)?` must succeed when no node exists to match +- **Branching at EOF**: `(a)?` must succeed when no node exists to match - **Trailing navigation**: Many queries end with epsilon + `Up(n)` to restore cursor position after matching descendants Epsilon transitions bypass the normal "check node exists → check type → check field" logic entirely. They execute effects and select successors without touching the cursor. @@ -195,13 +230,18 @@ struct MatchPayloadHeader { } ``` -**Body Layout** (contiguous, u16 aligned): +**Body Layout** (contiguous, u16 aligned, matches header order): 1. `pre_effects`: `[EffectOp; pre_count]` -2. `post_effects`: `[EffectOp; post_count]` -3. `negated_fields`: `[u16; neg_count]` +2. `negated_fields`: `[u16; neg_count]` +3. `post_effects`: `[EffectOp; post_count]` 4. `successors`: `[u16; succ_count]` (StepIds) +**Pre vs Post Effects**: + +- `pre_effects`: Execute before match attempt. Used for scope openers (`StartObject`, `StartArray`, `StartVariant`) that must run regardless of which branch succeeds. +- `post_effects`: Execute after successful match. Used for capture/assignment ops (`CaptureNode`, `SetField`, `EndObject`, etc.) that depend on `matched_node`. + **Continuation Logic**: | `succ_count` | Behavior | Use case | @@ -268,29 +308,26 @@ Entry ─ε→ Branch ─ε→ Match ─ε→ Exit Branch.successors = [match, skip] // try match first ``` -The `PushNull` effect on the skip path is required for **Row Integrity** (see [type-system.md](../type-system.md#4-row-integrity)). When `?` captures a synchronized field, the skip branch must emit a null placeholder to keep parallel arrays aligned. +The `PushNull` effect on the skip path emits an explicit null value when the optional pattern doesn't match. This distinguishes "not present" (`null`) from "not attempted." In alternations and optional captures, downstream consumers can differentiate between a missing match and a match that produced no value. ## 7. Alternation Compilation -Untagged alternations `[ A B ]` compile to branching with **symmetric effect injection** for row integrity. +Untagged alternations `[ A B ]` compile to branching with **symmetric null injection** for type consistency. -### Row Integrity in Alternations +### Null Injection in Alternations -When a capture appears in some branches but not others, the compiler injects `PushNull` into branches missing that capture: +When a capture appears in some branches but not others, the type system produces an optional field (`x?: T`). The compiler injects `PushNull` into branches missing that capture: ``` -Query: [ (A) @x (B) ] +Query: [ (a) @x (b) ] +Type: { x?: Node } -Branch 1 (A): [CaptureNode, PushField(x)] → Exit -Branch 2 (B): [PushNull, PushField(x)] → Exit +Branch 1 (a): [CaptureNode, SetField(x)] → Exit +Branch 2 (b): [PushNull, SetField(x)] → Exit ↑ injected ``` -In columnar context `([ (A) @x (B) ])*`: - -- Iteration 1 matches A: `x` array gets the node -- Iteration 2 matches B: `x` array gets null placeholder -- Result: `x` array length equals iteration count +The output object always has the `x` field set—either to a node or to null. This matches the type system's merged struct model. ### Multiple Captures @@ -298,17 +335,18 @@ Each missing capture gets its own `PushNull`: ``` Query: [ - { (A) @x (B) @y } - { (C) @x } - (D) + { (a) @x (b) @y } + { (c) @x } + (d) ] +Type: { x?: Node, y?: Node } -Branch 1: [CaptureNode, PushField(x), CaptureNode, PushField(y)] -Branch 2: [CaptureNode, PushField(x), PushNull, PushField(y)] -Branch 3: [PushNull, PushField(x), PushNull, PushField(y)] +Branch 1: [CaptureNode, SetField(x), CaptureNode, SetField(y)] +Branch 2: [CaptureNode, SetField(x), PushNull, SetField(y)] +Branch 3: [PushNull, SetField(x), PushNull, SetField(y)] ``` -This ensures all synchronized fields maintain identical array lengths across iterations. +This ensures the output object has all fields defined, with nulls for unmatched captures. ### Non-Greedy `??` diff --git a/docs/lang-reference.md b/docs/lang-reference.md index f11eebbd..d90d387b 100644 --- a/docs/lang-reference.md +++ b/docs/lang-reference.md @@ -20,29 +20,46 @@ NFA-based cursor walk with backtracking. Comments and "extra" nodes (per tree-sitter grammar) are automatically skipped unless explicitly matched. -```plotnik/docs/lang-reference.md#L24-24 +``` (function_declaration (identifier) @name (block) @body) ``` Matches even with comments between children: -```plotnik/docs/lang-reference.md#L28-31 +```javascript function foo /* comment */() { /* body */ } ``` -The `.` anchor enforces strict adjacency: +### Anchor Behavior + +The `.` anchor enforces adjacency, but its strictness depends on what's being anchored: + +**Between named nodes** — skips trivia, disallows other named nodes: -```plotnik/docs/lang-reference.md#L35-35 -(array . (identifier) @first) ; must be immediately after bracket ``` +(dotted_name (identifier) @a . (identifier) @b) +``` + +Matches `a.b` even if there's a comment like `a /* x */ .b` (trivia skipped), but won't match if another named node appears between them. + +**With anonymous nodes** — strict, nothing skipped: + +``` +(array "[" . (identifier) @first) ; must be immediately after bracket +(call_expression (identifier) @fn . "(") ; no trivia between name and paren +``` + +When any side of the anchor is an anonymous node (literal token), the match is exact—no trivia allowed. + +**Rule**: The anchor is as strict as its strictest operand. Anonymous nodes demand precision; named nodes tolerate trivia. ### Partial Matching Node patterns are open—unmentioned children are ignored: -```plotnik/docs/lang-reference.md#L46-46 +``` (binary_expression left: (identifier) @left) ``` @@ -54,7 +71,7 @@ Sequences `{...}` advance through siblings in order, skipping non-matching nodes `field: pattern` requires the child to have that field AND match the pattern: -```plotnik/docs/lang-reference.md#L58-61 +``` (binary_expression left: (identifier) @x right: (number) @y @@ -69,8 +86,8 @@ Fields participate in sequential matching—they're not independent lookups. A `.ptk` file contains definitions: -````plotnik/docs/lang-reference.md#L78-82 -```plotnik +```` +``` ; Helper (can also be used as entrypoint) Expr = [(identifier) (number) (string)] @@ -90,7 +107,7 @@ plotnik exec -q '(identifier) @id' -s app.js **Module** (`.ptk` files): Only named definitions allowed. -```plotnik +``` ; ERROR in .ptk file (identifier) @id @@ -124,7 +141,7 @@ Inferred from directory name (`queries.ts/` → TypeScript, `java-checks/` → J `helpers.ptk`: -```plotnik/docs/lang-reference.md#L147-153 +``` Ident = (identifier) DeepSearch = [ @@ -135,7 +152,7 @@ DeepSearch = [ `main.ptk`: -```plotnik +``` AllIdentifiers = (program (DeepSearch)*) ``` @@ -155,13 +172,11 @@ Tree-sitter allows `@function.name`; Plotnik requires `@function_name` because c ## Data Model -Plotnik infers output types from your query. The key rule may surprise you—but it's intentional for schema stability. +Plotnik infers output types from your query. See [Type System](type-system.md) for full details. ### Flat by Default -Query nesting does NOT create output nesting. All captures become fields in a single flat record. - -**Why?** Adding a new `@capture` to an existing query shouldn't break downstream code using other captures. Flat output makes capture additions non-breaking. See [Type System](type-system.md#design-philosophy) for the full rationale. +Query nesting does NOT create output nesting. All captures bubble up to the nearest scope boundary: ``` (function_declaration @@ -176,13 +191,28 @@ Output type: { name: Node, retval: Node } // flat, not nested ``` -The pattern is 4 levels deep, but the output is flat. This is intentional: you're usually extracting specific pieces from an AST, not reconstructing its shape. +The pattern is 4 levels deep, but the output is flat. You're extracting specific pieces from an AST, not reconstructing its shape. + +### Strict Dimensionality + +**Quantifiers (`*`, `+`) containing internal captures require an explicit row capture.** + +``` +// ERROR: internal capture without row structure +(method_definition name: (identifier) @name)* + +// OK: explicit row capture +{ (method_definition name: (identifier) @name) @method }* @methods +→ { methods: { method: Node, name: Node }[] } +``` + +This prevents association loss—each row is a distinct object, not parallel arrays that lose per-iteration grouping. See [Type System: Strict Dimensionality](type-system.md#1-strict-dimensionality). ### The Node Type Default capture type—a reference to a tree-sitter node: -```plotnik/docs/lang-reference.md#L205-210 +``` interface Node { kind: string; // e.g. "identifier" text: string; // source text @@ -193,14 +223,22 @@ interface Node { ### Cardinality: Quantifiers → Arrays -Quantifiers on the captured pattern determine whether a field is singular, optional, or an array: +Quantifiers determine whether a field is singular, optional, or an array: + +| Pattern | Output Type | Meaning | +| --------- | ---------------- | -------------------------- | +| `(x) @a` | `a: T` | exactly one | +| `(x)? @a` | `a?: T` | zero or one | +| `(x)* @a` | `a: T[]` | zero or more (scalar list) | +| `(x)+ @a` | `a: [T, ...T[]]` | one or more (scalar list) | + +Scalar lists work when the quantified pattern has **no internal captures**. For patterns with internal captures, use row lists: -| Pattern | Output Type | Meaning | -| --------- | ---------------- | ------------ | -| `(x) @a` | `a: T` | exactly one | -| `(x)? @a` | `a?: T` | zero or one | -| `(x)* @a` | `a: T[]` | zero or more | -| `(x)+ @a` | `a: [T, ...T[]]` | one or more | +| Pattern | Output Type | Meaning | +| -------------- | ---------------- | ------------------------------------ | +| `{...}* @rows` | `rows: T[]` | zero or more rows | +| `{...}+ @rows` | `rows: [T, ...]` | one or more rows | +| `{...}? @row` | `row?: T` | optional row (bubbles if uncaptured) | ### Creating Nested Structure @@ -262,14 +300,15 @@ interface FunctionDeclaration { ### Summary -| Pattern | Output | -| ----------------------- | ------------------------- | -| `@name` | Field in current scope | -| `(x)? @a` | Optional field | -| `(x)* @a` | Array field | -| `{...} @x` / `[...] @x` | Nested object (new scope) | -| `@x :: string` | String value | -| `@x :: T` | Custom type name | +| Pattern | Output | +| ----------------------- | ----------------------------------- | +| `@name` | Field in current scope | +| `(x)? @a` | Optional field | +| `(x)* @a` | Scalar array (no internal captures) | +| `{...}* @rows` | Row array (with internal captures) | +| `{...} @x` / `[...] @x` | Nested object (new scope) | +| `@x :: string` | String value | +| `@x :: T` | Custom type name | --- @@ -344,7 +383,7 @@ Output type: | `(_)` | Any named node | | `_` | Any node (named or anonymous) | -```plotnik/docs/lang-reference.md#L370-371 +``` (call_expression function: (_) @fn) (pair key: _ @key value: _ @value) ``` @@ -376,7 +415,7 @@ Output type: Query abstract node types directly, or narrow with `/`: -```plotnik/docs/lang-reference.md#L406-409 +``` (expression) @expr (expression/binary_expression) @binary (expression/"()") @empty_parens @@ -535,7 +574,7 @@ Match alternatives with `[...]`: - **Untagged**: Fields merge across branches - **Tagged** (with labels): Discriminated union -```plotnik/docs/lang-reference.md#L570-573 +``` [ (identifier) (string_literal) @@ -625,14 +664,14 @@ Output type: Labels create a discriminated union (`$tag` + `$data`): -```plotnik/docs/lang-reference.md#L657-660 +``` [ Assign: (assignment_expression left: (identifier) @left) Call: (call_expression function: (identifier) @func) ] @stmt :: Stmt ``` -```plotnik/docs/lang-reference.md#L664-667 +``` type Stmt = | { $tag: "Assign"; $data: { left: Node } } | { $tag: "Call"; $data: { func: Node } }; @@ -669,6 +708,21 @@ interface Target { The anchor `.` constrains sibling positions. Anchors don't affect types—they're structural constraints. +### Anchor Strictness + +Anchor behavior depends on the node types being anchored: + +| Pattern | Trivia Between | Named Nodes Between | +| ----------- | -------------- | ------------------- | +| `(a) . (b)` | Allowed | Disallowed | +| `"x" . (b)` | Disallowed | Disallowed | +| `(a) . "x"` | Disallowed | Disallowed | +| `"x" . "y"` | Disallowed | Disallowed | + +When anchoring named nodes, trivia (comments, whitespace) is skipped but no other named nodes may appear between. When any operand is an anonymous node (literal token), the anchor enforces exact adjacency—nothing in between. + +### Position Anchors + First child: ``` @@ -681,15 +735,25 @@ Last child: (block (_) @last .) ``` -Immediate adjacency: +### Adjacency Anchors ``` (dotted_name (identifier) @a . (identifier) @b) ``` -Without the anchor, `@a` and `@b` would match non-adjacent pairs too. +Without the anchor, `@a` and `@b` would match non-adjacent pairs too. With the anchor, only consecutive identifiers match (trivia like comments between them is tolerated). + +For strict token-level adjacency: + +``` +(call_expression (identifier) @fn . "(") +``` + +Here, no trivia is allowed between the function name and the opening parenthesis because `"("` is an anonymous node. -Output type for all examples: +### Output Types + +Anchors are structural constraints only—they don't affect output types: ```typescript { first: Node } @@ -705,7 +769,7 @@ Anchors ignore anonymous nodes. Define reusable patterns: -```plotnik/docs/lang-reference.md#L744-748 +``` BinaryOp = (binary_expression left: (_) @left @@ -715,7 +779,7 @@ BinaryOp = Use as node types: -```plotnik/docs/lang-reference.md#L752-752 +``` (return_statement (BinaryOp) @expr) ``` @@ -723,7 +787,7 @@ Use as node types: Named expressions define both pattern and type: -```plotnik/docs/lang-reference.md#L764-764 +``` Expr = [(BinaryOp) (UnaryOp) (identifier) (number)] ``` @@ -733,7 +797,7 @@ Expr = [(BinaryOp) (UnaryOp) (identifier) (number)] Named expressions can self-reference: -```plotnik/docs/lang-reference.md#L794-798 +``` NestedCall = (call_expression function: [(identifier) @name (NestedCall) @inner] @@ -744,7 +808,7 @@ Matches `a()`, `a()()`, `a()()()`, etc. → `{ name?: Node, inner?: NestedCall } Tagged recursive example: -```plotnik/docs/lang-reference.md#L810-815 +``` MemberChain = [ Base: (identifier) @name Access: (member_expression diff --git a/docs/runtime-engine.md b/docs/runtime-engine.md index 405f2295..5b5f8a1c 100644 --- a/docs/runtime-engine.md +++ b/docs/runtime-engine.md @@ -42,11 +42,11 @@ Fetch block at `ip` → dispatch by `type_id` → execute → update `ip`. ### Epsilon Transitions -A `MatchExt` with `node_type: None` and `nav: Stay` is an **epsilon transition**—it succeeds unconditionally without cursor interaction. This enables pure control-flow decisions (branching for quantifiers) even when the cursor is exhausted (EOF). +A `MatchExt` with `node_type: None`, `node_field: None`, and `nav: Stay` is an **epsilon transition**—it succeeds unconditionally without cursor interaction. This enables pure control-flow decisions (branching for quantifiers) even when the cursor is exhausted (EOF). Common patterns: -- **Quantifier branches**: `(A)?` uses epsilon to decide match-or-skip +- **Quantifier branches**: `(a)?` uses epsilon to decide match-or-skip - **Trailing cleanup**: Many queries end with epsilon + `Up(n)` to restore cursor position after matching, regardless of tree depth ### Call (0x02) @@ -101,15 +101,24 @@ struct Frame { ### Pruning -Problem: `(A)+` accumulates frames forever. Solution: high-water mark pruning after `Return`: +Problem: `(a)+` accumulates frames forever. Solution: high-water mark pruning after `Return`: ``` -high_water = max(current_frame_idx, max_checkpoint_watermark) +high_water = max(current_frame_idx, checkpoint_stack.max_frame_ref) arena.truncate(high_water + 1) ``` Bounds arena to O(max_checkpoint_depth + current_call_depth). +**O(1) Invariant**: The checkpoint stack maintains `max_frame_ref`—the highest `frame_index` referenced by any active checkpoint. + +| Operation | Invariant Update | Complexity | +| --------- | ---------------------------------------------------- | -------------- | +| Push | `max_frame_ref = max(max_frame_ref, cp.frame_index)` | O(1) | +| Pop | Recompute only if popping the max holder | O(1) amortized | + +Amortized analysis: each checkpoint contributes to at most one recomputation over its lifetime. + ### Call/Return Each call site stores its return address in the pushed frame. The `ref_id` check catches stack corruption (malformed IR or VM bug). @@ -146,18 +155,17 @@ struct EffectStream<'a> { } ``` -| Effect | Action | -| ------------------- | ---------------------------------- | -| CaptureNode | Push `matched_node` | -| Start/EndObject | Object boundaries | -| SetField(id) | Assign to field | -| PushField(id) | Append to array field (columnar) | -| Start/EndArray | Array boundaries | -| PushElement | Append to array | -| Start/EndVariant(t) | Tagged union boundaries | -| ToString | Node → source text | -| ClearCurrent | Reset current value | -| PushNull | Null placeholder (`?` in columnar) | +| Effect | Action | +| ------------------- | --------------------------------------- | +| CaptureNode | Push `matched_node` | +| Start/EndObject | Object boundaries | +| SetField(id) | Assign to field | +| Start/EndArray | Array boundaries | +| PushElement | Append to array | +| Start/EndVariant(t) | Tagged union boundaries | +| ToString | Node → source text | +| ClearCurrent | Reset current value | +| PushNull | Null placeholder (optional/alternation) | ### Materialization diff --git a/docs/type-system.md b/docs/type-system.md index 6f221277..53b0ee01 100644 --- a/docs/type-system.md +++ b/docs/type-system.md @@ -4,43 +4,39 @@ Plotnik infers static types from query structure. This governs how captures mate ## Design Philosophy -Plotnik prioritizes **schema evolution** and **refactoring safety** over local intuition. +Plotnik prioritizes **predictability** and **structural clarity** over terseness. Two principles guide the type system: -1. **Additive captures are non-breaking**: Adding a new `@capture` to an existing query should not invalidate downstream code that uses other captures. +1. **Explicit structure**: Captures bubble up to the nearest scope boundary. To create nested output, you must explicitly capture a group (`{...} @name`). -2. **Extract-refactor equivalence**: Moving a pattern fragment into a named definition should not change the output shape. +2. **Strict dimensionality**: Quantifiers (`*`, `+`) containing captures require an explicit row capture. This prevents parallel arrays where `a[i]` and `b[i]` lose their per-iteration association. -These constraints produce designs that may initially surprise users (parallel arrays instead of row objects, transparent scoping instead of nesting), but enable queries to evolve without breaking consumers. +### Why Strictness -### Why Parallel Arrays - -Traditional row-oriented output breaks when queries evolve: +Permissive systems create surprises: ``` -// v1: Extract names -(identifier)* @names -→ { names: Node[] } +// Permissive: implicit parallel arrays +{ (key) @k (value) @v }* +→ { k: Node[], v: Node[] } // Are k[0] and v[0] related? Maybe... -// v2: Also extract types (row-oriented would require restructuring) -{ (identifier) @name (type) @type }* @items -→ { items: [{ name, type }, ...] } // BREAKING: names[] is gone +// Iteration 1: k="a", v="1" +// Iteration 2: k="b", v="2" +// Output: { k: ["a","b"], v: ["1","2"] } // Association lost in flat arrays ``` -Plotnik's columnar approach: +Plotnik's strict approach: ``` -// v1 -(identifier)* @names -→ { names: Node[] } +// Strict: explicit row structure +{ (key) @k (value) @v }* @pairs +→ { pairs: { k: Node, v: Node }[] } // Each pair is a distinct object -// v2: Add types alongside -{ (identifier) @names (type) @types }* -→ { names: Node[], types: Node[] } // NON-BREAKING: names[] unchanged +// Output: { pairs: [{ k: "a", v: "1" }, { k: "b", v: "2" }] } ``` -Existing code using `result.names[i]` continues to work. +The explicit `@pairs` capture tells both the compiler and reader: "this is a list of structured rows." ### Why Transparent Scoping @@ -59,22 +55,125 @@ Func = (function name: (identifier) @name) If definitions created implicit boundaries, extraction would wrap output in a new struct, breaking downstream types. -## Mental Model +## 1. Strict Dimensionality + +This is the core rule that prevents association loss. + +### The Rule + +**Any quantified pattern (`*`, `+`) containing captures must have an explicit row capture.** + +| Pattern | Status | Reason | +| --------------------------------- | ------- | ------------------------------------------ | +| `(identifier)* @ids` | ✓ Valid | No internal captures → scalar list | +| `{ (a) @a (b) @b }* @rows` | ✓ Valid | Internal captures + row capture → row list | +| `{ (a) @a (b) @b }*` | ✗ Error | Internal captures, no row capture | +| `(func (id) @name)*` | ✗ Error | Internal capture, no row structure | +| `(func (id) @name)* @funcs` | ✗ Error | `@funcs` captures nodes, not rows | +| `(Item)*` where Item has captures | ✗ Error | Transitive: definition's captures count | + +### Transitive Application + +Strict dimensionality applies **transitively through definitions**. Since definitions are transparent (captures bubble up), quantifying a definition that contains captures is equivalent to quantifying those captures directly: + +``` +// Definition with capture +Item = (pair (key) @k (value) @v) + +// These are equivalent after expansion: +(Item)* // ✗ Error +(pair (key) @k (value) @v)* // ✗ Error (same thing) + +// Fix: wrap in row capture +{ (Item) @item }* @items // ✓ Valid +``` + +The compiler expands definitions before validating strict dimensionality. This prevents a loophole where extracting a pattern into a definition would bypass the rule. + +### Scalar Lists -| Operation | Nested (tree-sitter) | Transparent (Plotnik) | -| ------------------ | -------------------- | --------------------- | -| Extract definition | `res.def.x` | `res.x` (unchanged) | -| List of items | Implicit row struct | Explicit `{...} @row` | -| Capture collision | Silent data loss | Compiler error | -| Fix collision | Manual re-capture | Wrap: `(Def) @alias` | +When the quantified pattern has **no internal captures**, the outer capture collects nodes directly: -## 1. Transparent Graph Model +``` +(decorator)* @decorators +→ { decorators: Node[] } + +(identifier)+ @names +→ { names: [Node, ...Node[]] } // Non-empty array +``` + +Use case: collecting simple tokens (identifiers, keywords, literals). + +### Row Lists + +When the quantified pattern **has internal captures**, wrap in a sequence and capture the sequence: + +``` +{ + (decorator) @dec + (function_declaration) @fn +}* @items +→ { items: { dec: Node, fn: Node }[] } +``` + +For node patterns with internal captures, wrap explicitly: + +``` +// ERROR: internal capture without row structure +(parameter (identifier) @name)* + +// OK: explicit row +{ (parameter (identifier) @name) @param }* @params +→ { params: { param: Node, name: string }[] } +``` + +### Optional Bubbling + +The `?` quantifier does **not** add dimensionality—it produces at most one value, not a list. Therefore, optional groups without captures are allowed: + +``` +{ (decorator) @dec }? +→ { dec?: Node } // Bubbles to parent as optional field + +{ (modifier) @mod (decorator) @dec }? +→ { mod?: Node, dec?: Node } // Both bubble as optional +``` + +This lets optional fragments contribute fields directly to the parent struct without forcing an extra wrapper object. + +### Why This Matters + +Consider extracting methods from classes: + +``` +// What we want: list of method objects +(class_declaration + body: (class_body + { (method_definition + name: (property_identifier) @name + parameters: (formal_parameters) @params + ) @method + }* @methods)) +→ { methods: { method: Node, name: Node, params: Node }[] } + +// Without strict dimensionality, you might write: +(class_declaration + body: (class_body + (method_definition + name: (property_identifier) @name + parameters: (formal_parameters) @params)*)) +→ { name: Node[], params: Node[] } // Parallel arrays—which name goes with which params? +``` + +The strict rule forces you to think about structure upfront. + +## 2. Scope Model ### Universal Bubbling Scopes are transparent by default. Captures bubble up through definitions and containers until hitting an explicit scope boundary. -This enables reusable fragments ("mixins") that contribute fields to parent output without creating nesting. +This enables reusable pattern fragments that contribute fields directly to parent output without creating nesting. - **Definitions (`Def = ...`)**: Transparent (macro-like) - **Uncaptured Containers (`{...}`, `[...]`)**: Transparent @@ -88,7 +187,7 @@ New data structures are created only when explicitly requested: 2. **Captured Alternations**: `[...] @name` → Union 3. **Tagged Alternations**: `[ L: ... ] @name` → Tagged Union -## 2. Data Shapes +## 3. Data Shapes ### Structs @@ -96,26 +195,28 @@ Created by `{ ... } @name`: | Captures | Result | | -------- | ---------------------------------- | -| 0 | `Void` | +| 0 | `Struct {}` (Empty) | | 1+ | `Struct { field_1, ..., field_N }` | -**No Implicit Unwrap**: `(node) @x` produces `{ x: Node }`, never bare `Node`. Adding fields later is non-breaking. +**No Implicit Unwrap**: `(node) @x` produces `{ x: Node }`, never bare `Node`. + +**Empty Structs**: `{ ... } @x` with no internal captures produces `{ x: {} }`. This ensures `x` is always an object, so adding fields later is non-breaking. ### Unions Created by `[ ... ]`: -- **Tagged**: `[ L1: (A) @a L2: (B) @b ]` → `{ "$tag": "L1", "$data": { a: Node } }` -- **Untagged**: `[ (A) @a (B) @b ]` → `{ a?: Node, b?: Node }` (merged) +- **Tagged**: `[ L1: (a) @a L2: (b) @b ]` → `{ "$tag": "L1", "$data": { a: Node } }` +- **Untagged**: `[ (a) @a (b) @b ]` → `{ a?: Node, b?: Node }` (merged) ### Enum Variants -| Captures | Payload | -| -------- | --------- | -| 0 | Unit/Void | -| 1+ | Struct | +| Captures | Payload | +| -------- | ------------------- | +| 0 | `Struct {}` (Empty) | +| 1+ | Struct | -```plotnik/docs/type-system.md#L58-61 +``` Result = [ Ok: (value) @val Err: (error (code) @code (message) @msg) @@ -124,80 +225,40 @@ Result = [ Single-capture variants stay wrapped (`result.$data.val`), making field additions non-breaking. -## 3. Parallel Arrays (Columnar Output) - -Quantifiers (`*`, `+`) produce arrays per-field, not lists of objects: - -```plotnik/docs/type-system.md#L75-75 -{ (Key) @k (Value) @v }* -``` - -Output: `{ "k": ["key1", "key2"], "v": ["val1", "val2"] }` - -This Struct-of-Arrays layout enables non-breaking schema evolution: adding `@newfield` to an existing loop doesn't restructure existing fields. It also avoids implicit row creation and is efficient for columnar analysis. - -For List-of-Objects, wrap explicitly: - -```plotnik/docs/type-system.md#L84-84 -( { (Key) @k (Value) @v } @entry )* -``` - -Output: `{ "entry": [{ "k": "key1", "v": "val1" }, ...] }` - -## 4. Row Integrity +## 4. Cardinality -Parallel arrays require `a[i]` to correspond to `b[i]`. The compiler enforces this: +Quantifiers determine whether a field is singular, optional, or an array: -**Rule**: Quantified scopes cannot mix synchronized and desynchronized fields. +| Pattern | Output Type | Meaning | +| --------- | ---------------- | ------------ | +| `(x) @a` | `a: T` | exactly one | +| `(x)? @a` | `a?: T` | zero or one | +| `(x)* @a` | `a: T[]` | zero or more | +| `(x)+ @a` | `a: [T, ...T[]]` | one or more | -| Type | Cardinality | Behavior | -| -------------- | ----------- | ---------------------------------------------------- | -| Synchronized | `1` or `?` | One value per iteration (`?` emits null when absent) | -| Desynchronized | `*` or `+` | Variable values per iteration | +### Row Cardinality -`?` is synchronized because it emits null placeholders—like nullable columns in Arrow/Parquet. +When using row lists, the outer quantifier determines list cardinality: -### Nested Quantifiers - -Cardinality multiplies through nesting: - -| Outer | Inner | Result | -| ----- | ----- | ------ | -| `1` | `*` | `*` | -| `*` | `1` | `*` | -| `*` | `*` | `*` | -| `+` | `+` | `+` | -| `?` | `+` | `*` | - -Example: - -```plotnik/docs/type-system.md#L123-123 -{ (A)* @a (B) @b }* // ERROR: @a is *, @b is 1 -{ (A)? @a (B) @b }* // OK: both synchronized ``` - -Fixes: - -```plotnik/docs/type-system.md#L128-129 -{ (A)* @a (B)* @b }* // Both columnar -{ { (A)* @a (B) @b } @row }* // Wrap for rows +{ (a) @a (b) @b }* @rows → rows: { a: T, b: T }[] +{ (a) @a (b) @b }+ @rows → rows: [{ a: T, b: T }, ...] +{ (a) @a (b) @b }? @row → row?: { a: T, b: T } ``` -### Multiple Desynchronized Fields - -When multiple `*`/`+` fields coexist, each produces an independent array with no alignment guarantee: - -``` -{ (A)* @a (B)* @b }* -``` +### Nested Quantifiers -If iteration 1 yields `a: [1,2,3], b: [x]` and iteration 2 yields `a: [4], b: [y,z]`, the result is: +Within a row, inner quantifiers apply to fields: ``` -{ a: [1,2,3,4], b: [x,y,z] } // lengths differ, no row correspondence +{ + (decorator)* @decs // Array field within each row + (function) @fn // Singular field within each row +}* @items +→ { items: { decs: Node[], fn: Node }[] } ``` -This is valid columnar concatenation—arrays are independent streams. If you need per-iteration grouping, wrap with `{...} @row`. +Each row has its own `decs` array—no cross-row mixing. ## 5. Type Unification in Alternations @@ -209,20 +270,20 @@ Shallow unification across untagged branches: | Same capture, some branches | Optional | | Type mismatch | Compile error | -```plotnik/docs/type-system.md#L140-160 +``` [ - (A) @x - (B) @x + (a) @x + (b) @x ] // x: Node (required) [ - (_ (A) @x (B) @y) - (_ (A) @x) + (_ (a) @x (b) @y) + (_ (a) @x) ] // x: Node, y?: Node [ - (A) @x ::string - (B) @x + (a) @x ::string + (b) @x ] // ERROR: String vs Node ``` @@ -230,36 +291,21 @@ Shallow unification across untagged branches: When a quantified capture appears in some branches but not others, the result is `Array | null`: -```plotnik/docs/type-system.md#L166-170 +``` [ - (A)+ @x - (B) + (a)+ @x + (b) ] // x: Node[] | null ``` -The missing branch emits `PushNull`, not an empty array. This distinction matters for columnar output—`null` indicates "branch didn't match" vs `[]` meaning "matched zero times." - -Note the `*` vs `+` difference: - -``` -[ (A)+ @x (B) ] // x: Node[] | null — null means B branch -[ (A)* @x (B) ] // x: Node[] | null — null means B branch, [] means A matched zero times -``` - -In the `*` case, `null` and `[]` are semantically distinct. Check explicitly: - -```typescript -if (result.x !== null) { - // A branch matched (possibly zero times if x.length === 0) -} -``` +The missing branch emits `null`, not an empty array. This distinction matters: `null` means "branch didn't match" vs `[]` meaning "matched zero times." For type conflicts, use tagged alternations: -```plotnik/docs/type-system.md#L157-160 +``` [ - Str: (A) @x ::string - Node: (B) @x + Str: (a) @x ::string + Node: (b) @x ] @result ``` @@ -274,7 +320,7 @@ For type conflicts, use tagged alternations: Top-level fields merge with optionality; nested mismatches are errors: -```/dev/null/merge.txt#L1-8 +``` // OK: top-level merge { x: Node, y: Node } ∪ { x: Node, z: String } → { x: Node, y?: Node, z?: String } @@ -298,7 +344,7 @@ Self-referential types via: ### Example -```plotnik/docs/type-system.md#L213-219 +``` Expr = [ Lit: (number) @value ::string Binary: (binary_expression @@ -310,7 +356,7 @@ Expr = [ ### Requirements -```plotnik/docs/type-system.md#L226-232 +``` Loop = (Loop) // ERROR: no escape path Expr = [ Lit: (n) @n Rec: (Expr) @e ] // OK: Lit escapes @@ -325,25 +371,11 @@ B = (bar (A)) // OK: descends each step Recursive definitions get automatic type boundaries: -```plotnik/docs/type-system.md#L240-241 +``` NestedCall = (call_expression function: [(identifier) @name (NestedCall) @inner]) ``` -### Recursive Deep Search - -Combines recursion with bubbling for flat output: - -```plotnik/docs/type-system.md#L249-253 -DeepSearch = [ - (identifier) @target - (_ (DeepSearch)*) -] -AllIdentifiers = (program (DeepSearch)*) -``` - -Output: `{ target: Node[] }` — flat array regardless of tree depth. - ## 7. Type Metadata For codegen, types are named: