Skip to content

Commit 21adcba

Browse files
committed
docs: document query engine as code — operator composability and pipeline API
Add new section showing how every SQL clause maps to a composable operator class, with examples of direct operator composition, DataFrame sugar, and memory-bounded R2 spill. Update test/benchmark counts to reflect conformance suite and CI benchmarks.
1 parent 1e26a28 commit 21adcba

File tree

1 file changed

+90
-2
lines changed

1 file changed

+90
-2
lines changed

README.md

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,94 @@ Your app code IS the query execution. The WASM engine is a library function your
4848
- No fixed operator set — your code can do anything between query steps
4949
- Same binary everywhere — browser, Node/Bun, Cloudflare DO
5050

51+
## Query engine as code
52+
53+
Every SQL clause is a composable code primitive. They all implement the same pull-based `Operator` interface — `next() → RowBatch | null` — so you chain them however you want, not how a SQL planner decides.
54+
55+
```
56+
SQL clause Operator class What it does
57+
───────── ────────────── ────────────
58+
WHERE FilterOperator Predicate pushdown on rows
59+
SELECT ProjectOperator Column projection
60+
ORDER BY ExternalSortOperator Disk-spilling merge sort
61+
InMemorySortOperator In-memory sort (small datasets)
62+
GROUP BY + agg AggregateOperator Hash aggregate (sum/avg/min/max/count/stddev/median/percentile)
63+
LIMIT / OFFSET LimitOperator Row limiting with offset
64+
TopKOperator Heap-based top-K (no full sort)
65+
JOIN HashJoinOperator Grace hash join with R2 spill
66+
PARTITION BY WindowOperator row_number, rank, dense_rank, lag, lead, rolling aggregates
67+
DISTINCT DistinctOperator Hash-based deduplication
68+
UNION/INTERSECT SetOperator Set operations (union, union_all, intersect, except)
69+
computed column ComputedColumnOperator Arbitrary (row: Row) => value transforms
70+
IN (subquery) SubqueryInOperator Semi-join filter against a value set
71+
```
72+
73+
### Compose operators directly
74+
75+
```typescript
76+
import {
77+
FilterOperator, AggregateOperator, HashJoinOperator,
78+
WindowOperator, TopKOperator, drainPipeline,
79+
type Operator, type RowBatch,
80+
} from "querymode"
81+
82+
// Your data source — any async batch producer
83+
const source: Operator = {
84+
async next() { /* return RowBatch or null */ },
85+
async close() {},
86+
}
87+
88+
// Chain operators like function calls — no query planner, no SQL string
89+
const filtered = new FilterOperator(source, [{ column: "age", op: "gt", value: 25 }])
90+
const aggregated = new AggregateOperator(filtered, {
91+
table: "users", filters: [], projections: [],
92+
groupBy: ["region"],
93+
aggregates: [{ fn: "sum", column: "amount", alias: "total" }],
94+
})
95+
const top10 = new TopKOperator(aggregated, "total", true, 10)
96+
97+
// Pull results — zero-copy, no serialization between stages
98+
const rows = await drainPipeline(top10)
99+
```
100+
101+
### Or use the DataFrame API
102+
103+
The same operators power the fluent API — `.filter()` becomes `FilterOperator`, `.sort()` becomes `ExternalSortOperator`, etc:
104+
105+
```typescript
106+
const qm = QueryMode.local()
107+
const results = await qm
108+
.table("orders")
109+
.filter("amount", "gt", 100)
110+
.groupBy("region")
111+
.aggregate("sum", "amount", "total")
112+
.sort("total", "desc")
113+
.limit(10)
114+
.exec()
115+
```
116+
117+
Both paths produce the same pull-based pipeline. The DataFrame API is sugar; the operators are the engine.
118+
119+
### Memory-bounded with R2 spill
120+
121+
Operators that accumulate state (sort, join, aggregate) accept a memory budget. When exceeded, they spill to R2 via `SpillBackend` — same interface whether running on Cloudflare edge or local disk:
122+
123+
```typescript
124+
import { HashJoinOperator, ExternalSortOperator, R2SpillBackend } from "querymode"
125+
126+
const spill = new R2SpillBackend(env.DATA_BUCKET, "__spill/query-123")
127+
const join = new HashJoinOperator(left, right, "user_id", "id", "inner", 32 * 1024 * 1024, spill)
128+
const sorted = new ExternalSortOperator(join, "created_at", true, 0, 32 * 1024 * 1024, spill)
129+
const rows = await drainPipeline(sorted)
130+
await spill.cleanup()
131+
```
132+
133+
### Why this matters
134+
135+
Traditional engines give you SQL or a DataFrame API. You can't put a window function before a join, run custom logic between pipeline stages, or swap the sort implementation. The planner decides.
136+
137+
With QueryMode, operators are building blocks. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
138+
51139
## What exists
52140

53141
- **TypeScript orchestration** — Durable Object lifecycle, R2 range reads, footer caching, request routing
@@ -61,13 +149,13 @@ Your app code IS the query execution. The WASM engine is a library function your
61149
- **Multi-format support** — Lance, Parquet, and Iceberg tables
62150
- **Local mode** — same API reads Lance/Parquet files from disk or HTTP (Node/Bun)
63151
- **Fragment DO pool** — fan-out parallel scanning for multi-fragment datasets (max 20 slots per datacenter)
64-
- **112 unit tests + 20 integration tests** — footer parsing, column decoding, Parquet/Thrift, merging, aggregates, VIP cache, WASM integration (skipped without binary)
152+
- **112 unit tests + 26 conformance tests** — unit tests cover footer parsing, column decoding, Parquet/Thrift, merging, aggregates, VIP cache, WASM integration; conformance tests validate every operator against DuckDB at 1M-5M row scale
153+
- **CI benchmarks** — head-to-head QueryMode (Miniflare) vs DuckDB (native) on every push, results posted to GitHub Actions summary
65154

66155
## What doesn't exist yet
67156

68157
- No deployed instance
69158
- No browser mode
70-
- No benchmarks against real data
71159
- No npm package published
72160

73161
## Architecture

0 commit comments

Comments
 (0)