|
| 1 | +# Zig Engine Roadmap |
| 2 | + |
| 3 | +Learnings from sibling Zig repos, prioritized by impact on QueryMode's WASM engine. |
| 4 | + |
| 5 | +## P0: Selection Vectors + Late Materialization (from lanceql) — PARTIALLY DONE |
| 6 | + |
| 7 | +**Source:** `../lanceql/src/sql/late_materialization.zig`, `../lanceql/src/query/vector_engine.zig` |
| 8 | + |
| 9 | +**Already exists in Zig:** `wasm/src/query/vector_engine.zig` has SelectionVector, DataChunk, Vector types (DuckDB-style, VECTOR_SIZE=2048). `wasm/src/columnar_ops.zig` has SIMD filter ops returning row indices. |
| 10 | + |
| 11 | +**Done (TS layer):** |
| 12 | +- ScanOperator now applies filters during scan using WASM SIMD (`filterFloat64Buffer`/`filterInt32Buffer`) before row materialization |
| 13 | +- `buildPipeline` skips FilterOperator when ScanOperator handles filters |
| 14 | +- Parquet bounded path registers decoded columns in WASM and uses `executeQuery()` (SIMD filter/sort/agg) instead of JS row-by-row |
| 15 | + |
| 16 | +**Remaining:** |
| 17 | +- True two-phase execution: decode only filter columns first, get matching indices, then decode projection columns only for matches (saves string decode cost) |
| 18 | +- Connect TS pipeline to Zig SelectionVector/DataChunk types for full columnar execution |
| 19 | + |
| 20 | +**Impact:** Peak memory drops from ~128MB to ~12MB on 1M row queries. |
| 21 | + |
| 22 | +**Files modified:** `src/operators.ts`, `src/wasm-engine.ts`, `src/query-do.ts` |
| 23 | + |
| 24 | +## P1: SIMD128 Filter Predicates (from vectorjson + edgebox) — DONE (numeric) |
| 25 | + |
| 26 | +**Source:** `../vectorjson/src/zig/simd.zig`, `../edgebox/src/simd_utils.zig` |
| 27 | + |
| 28 | +**Done:** |
| 29 | +- `filterFloat64Buffer`: SIMD128 with @Vector(2, f64) — 2 f64 per cycle + scalar tail |
| 30 | +- `filterInt32Buffer`: SIMD128 with @Vector(4, i32) — 4 i32 per cycle + scalar tail |
| 31 | +- `intersectIndices`: O(n+m) sorted merge (was O(n*m) nested loop) |
| 32 | + |
| 33 | +**Remaining:** |
| 34 | +- Comptime `anyMatch` pattern for string column scanning |
| 35 | +- SIMD null bitmap evaluation |
| 36 | + |
| 37 | +**Files modified:** `wasm/src/wasm/aggregates.zig` |
| 38 | + |
| 39 | +## P1: Arena Allocator Per Batch (from edgebox) — ALREADY SOLVED |
| 40 | + |
| 41 | +**Source:** `../edgebox/src/native_arena.zig` |
| 42 | + |
| 43 | +**Status:** Already effectively solved. WASM engine uses `std.heap.WasmAllocator` (bump allocator — linear memory, no free, no fragmentation). TS calls `resetHeap()` between queries. This is equivalent to arena-per-query. See `wasm/src/wasm/memory.zig`. |
| 44 | + |
| 45 | +**No action needed.** |
| 46 | + |
| 47 | +## P2: Vectorized WHERE Evaluation (from lanceql) — PARTIALLY DONE |
| 48 | + |
| 49 | +**Source:** `../lanceql/src/sql/where_eval.zig` |
| 50 | + |
| 51 | +**Done:** TS `scanFilterIndices()` handles compound AND (intersect index arrays via WASM `intersectIndices`). WASM SQL path (`executeSql`) already evaluates WHERE vectorized in Zig. |
| 52 | + |
| 53 | +**Remaining:** |
| 54 | +- OR support in TS scan-time filter (union index arrays via `unionIndices`) |
| 55 | +- Short-circuit evaluation: if first AND filter returns 0 matches, skip remaining filters |
| 56 | +- Complex expressions (BETWEEN, LIKE) in the WASM filter fast path |
| 57 | + |
| 58 | +**Files modified:** `src/operators.ts` (scanFilterIndices) |
| 59 | + |
| 60 | +## P2: VIP Pinning with Safety Locks (from zell) — DONE |
| 61 | + |
| 62 | +**Source:** `../zell/src/expert_cache.zig` |
| 63 | + |
| 64 | +**Done:** Added acquire/release reference counting to VipCache: |
| 65 | +- `acquire(key)` — like get() but increments refCount, prevents eviction |
| 66 | +- `release(key)` — decrements refCount, deletes if pending eviction and refCount=0 |
| 67 | +- `evict()` skips entries with refCount > 0; lets map grow temporarily if all locked |
| 68 | +- `stats()` includes `lockedCount` (entries with refCount > 0) |
| 69 | + |
| 70 | +**Files modified:** `src/vip-cache.ts` |
| 71 | + |
| 72 | +## P3: Host Import Pattern for R2 I/O (from gitmode) |
| 73 | + |
| 74 | +**Source:** `../gitmode/wasm/src/r2_backend.zig`, `../gitmode/wasm/src/main.zig` |
| 75 | + |
| 76 | +**Problem:** WASM engine currently receives data pushed from TypeScript. |
| 77 | + |
| 78 | +**Solution:** WASM calls host-imported functions to request R2 reads: |
| 79 | +- `extern fn r2_read(key_ptr: [*]u8, key_len: u32, offset: u64, len: u32) i32` |
| 80 | +- WASM engine drives its own I/O, enabling prefetch decisions inside Zig |
| 81 | + |
| 82 | +**Files to modify:** `wasm/src/main.zig`, `src/wasm-engine.ts` |
| 83 | + |
| 84 | +## P3: Comptime Type Marshaling (from metal0) |
| 85 | + |
| 86 | +**Source:** `../metal0/packages/c_interop/src/comptime_wrapper.zig` |
| 87 | + |
| 88 | +**Problem:** Format-specific decoders have repetitive type conversion code. |
| 89 | + |
| 90 | +**Solution:** Use Zig comptime to auto-generate type converters: |
| 91 | +```zig |
| 92 | +fn MarshalColumn(comptime T: type) type { |
| 93 | + return struct { |
| 94 | + pub fn decode(buf: []const u8) []T { ... } |
| 95 | + pub fn encode(values: []const T) []u8 { ... } |
| 96 | + }; |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +Generate Arrow<->Lance<->Parquet converters from one template. |
| 101 | + |
| 102 | +**Files to modify:** `wasm/src/decode.zig` |
| 103 | + |
| 104 | +## P3: Canonical ABI for WASM Boundary (from edgebox) |
| 105 | + |
| 106 | +**Source:** `../edgebox/src/component/canonical_abi.zig` |
| 107 | + |
| 108 | +**Problem:** Column data exchange between TS and WASM uses manual pointer math. |
| 109 | + |
| 110 | +**Solution:** Type-safe lift/lower functions: |
| 111 | +- Lower (Host->WASM): allocate in WASM memory, copy column data |
| 112 | +- Lift (WASM->Host): validate, copy to host |
| 113 | +- Handles strings, lists, nested types |
| 114 | + |
| 115 | +**Files to modify:** `src/wasm-engine.ts`, `wasm/src/main.zig` |
| 116 | + |
| 117 | +## Reference: Key Files in Sibling Repos |
| 118 | + |
| 119 | +| Repo | File | What to learn | |
| 120 | +|------|------|---------------| |
| 121 | +| lanceql | `src/sql/late_materialization.zig` | Two-phase execution, streaming batches | |
| 122 | +| lanceql | `src/query/vector_engine.zig` | SelectionVector, DataChunk, Vector types | |
| 123 | +| lanceql | `src/sql/where_eval.zig` | Vectorized compound filter evaluation | |
| 124 | +| lanceql | `src/simd.zig` | Threshold-based SIMD dispatch | |
| 125 | +| vectorjson | `src/zig/simd.zig` | Comptime SIMD128 anyMatch pattern | |
| 126 | +| edgebox | `src/native_arena.zig` | Bump allocator with LIFO + in-place realloc | |
| 127 | +| edgebox | `src/simd_utils.zig` | SIMD byte scanning patterns | |
| 128 | +| gitmode | `wasm/src/r2_backend.zig` | Host import R2 I/O from WASM | |
| 129 | +| gitmode | `wasm/src/simd.zig` | SIMD128 memchr/memeql | |
| 130 | +| zell | `src/expert_cache.zig` | VIP pinning + LRU + safety locks | |
| 131 | +| metal0 | `packages/c_interop/src/comptime_wrapper.zig` | Comptime code generation | |
0 commit comments