Skip to content

Commit f490ab1

Browse files
committed
docs: trim README from 351 to 89 lines — link to docs instead of duplicating
Kept: quickstart, pitch, build commands, status. Removed: operator table, architecture diagram, full API reference, examples, MapReduce explanation — all live in the docs site.
1 parent 04e1200 commit f490ab1

File tree

1 file changed

+37
-299
lines changed

1 file changed

+37
-299
lines changed

README.md

Lines changed: 37 additions & 299 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,13 @@
22

33
> **Experimental** — early prototype, not production-ready. Architecture and API will change.
44
5+
A pluggable columnar query library — not a query engine you push data to, but a query capability your code uses directly. No data materialization, no engine boundary, no SQL transpilation.
6+
7+
**[Docs](https://teamchong.github.io/querymode/)** · **[Why QueryMode?](https://teamchong.github.io/querymode/why-querymode/)** · **[Architecture](https://teamchong.github.io/querymode/architecture/)**
8+
59
## Quickstart
610

711
```bash
8-
# Clone and use from source (not yet published to npm)
912
git clone https://github.com/teamchong/querymode.git
1013
cd querymode && pnpm install
1114
```
@@ -14,337 +17,72 @@ cd querymode && pnpm install
1417
import { QueryMode } from "querymode/local"
1518

1619
// Zero-config: demo data, no files needed
17-
const demo = QueryMode.demo()
18-
const top5 = await demo
20+
const top5 = await QueryMode.demo()
1921
.filter("category", "eq", "Electronics")
2022
.sort("amount", "desc")
2123
.limit(5)
2224
.collect()
2325

24-
console.log(top5.rows)
25-
26-
// Or query your own files — Parquet, Lance, CSV, JSON, Arrow
26+
// Query your own files — Parquet, Lance, CSV, JSON
2727
const qm = QueryMode.local()
2828
const result = await qm
2929
.table("./data/events.parquet")
3030
.filter("status", "eq", "active")
31-
.filter("amount", "gte", 100)
32-
.filter("amount", "lte", 500)
3331
.select("id", "amount", "region")
3432
.sort("amount", "desc")
3533
.limit(20)
3634
.collect()
37-
```
38-
39-
A pluggable columnar query library — not a query engine you push data to, but a query capability your code uses directly. No data materialization, no engine boundary, no SQL transpilation.
40-
41-
**[Why QueryMode?](https://teamchong.github.io/querymode/why-querymode/)** — Agents need dynamic pipelines, not pre-built ETL. QueryMode lets the agent define both query and business logic in the same code, at query time, with no serialization boundary between stages.
42-
43-
## Why "mode" not "engine"
44-
45-
Every query engine — Spark, DataFusion, DuckDB, Polars — has a boundary between your code and the engine:
46-
47-
```
48-
Traditional engine:
49-
50-
Your Code Engine
51-
───────── ──────
52-
filter(age > 25) ──────► translate to internal plan
53-
materialize data into Arrow/DataFrame
54-
run engine's fixed operators
55-
serialize results
56-
◄────── return results to your code
57-
58-
Your code CANNOT cross the boundary.
59-
Custom business logic? Pull data out, process in your code, push back in.
60-
That round-trip IS data materialization.
61-
```
62-
63-
LINQ and ORMs look like code-first but they're transpiling expressions to SQL strings sent to a separate database. The database still materializes your data into its format, runs its fixed operators, and sends results back.
64-
65-
QueryMode has no boundary:
66-
67-
```typescript
68-
// This IS the execution — not a description translated to SQL
69-
const orders = await qm.table("orders").filter("amount", "gt", 100).exec()
70-
const userIds = orders.rows.map(r => r.user_id) // your code, zero materialization
71-
const users = await qm.table("users").filter("id", "in", userIds).exec()
72-
73-
// JOIN logic, business rules, ML scoring — all your code
74-
// WASM handles byte-level column decode + SIMD filtering
75-
// But orchestration is YOUR code, not a query planner's fixed operators
76-
```
77-
78-
Your app code IS the query execution. The WASM engine is a library function your code calls — column decoding, SIMD filtering, vector search happen in-process, on raw bytes, zero-copy. There's no "register UDF → engine materializes data → calls your function → collects results" boundary.
79-
80-
**What this means in practice:**
81-
- No data materialization — data stays in R2/disk, only the exact matching bytes are read
82-
- No engine boundary — your business logic runs directly, not as a registered UDF
83-
- No SQL transpilation — the API calls ARE the execution, not a description sent elsewhere
84-
- No fixed operator set — your code can do anything between query steps
85-
- Same binary everywhere — browser, Node/Bun, Cloudflare DO
86-
87-
## Query engine as code
88-
89-
Every query operation is a composable code primitive. They all implement the same pull-based `Operator` interface — `next() → RowBatch | null` — so you chain them however you want.
90-
91-
```
92-
Operation Operator class What it does
93-
───────── ────────────── ────────────
94-
95-
Filtering
96-
predicate FilterOperator 14 ops: eq, neq, gt, gte, lt, lte, in, not_in,
97-
between, not_between, like, not_like, is_null, is_not_null
98-
membership SubqueryInOperator Semi-join filter against a value set
99-
100-
Projection
101-
select ProjectOperator Column selection
102-
transform ComputedColumnOperator Arbitrary (row: Row) => value per row
103-
104-
Aggregation
105-
group + reduce AggregateOperator sum, avg, min, max, count, count_distinct,
106-
stddev, variance, median, percentile
107-
having FilterOperator Filter after AggregateOperator — same primitive, you control order
108-
109-
Sorting
110-
full sort ExternalSortOperator Disk-spilling merge sort with R2 spill
111-
in-memory sort InMemorySortOperator In-memory sort (small datasets)
112-
top-K TopKOperator Heap-based top-K without full sort
113-
114-
Joining
115-
hash join HashJoinOperator inner, left, right, full, cross — Grace hash join with R2 spill
116-
117-
Windowing
118-
partition WindowOperator row_number, rank, dense_rank, lag, lead,
119-
rolling sum/avg/min/max/count
120-
121-
Deduplication
122-
distinct DistinctOperator Hash-based deduplication on column set
123-
124-
Set operations
125-
combine SetOperator union, union_all, intersect, except
126-
127-
Limiting
128-
limit/offset LimitOperator Row limiting with offset
129-
sample DataFrame.sample() Random sampling (Fisher-Yates)
130-
131-
Similarity
132-
vector near (planned) NEAR topK as composable operator — currently in scan layer
133-
```
134-
135-
### Compose operators directly
136-
137-
```typescript
138-
import {
139-
FilterOperator, AggregateOperator, HashJoinOperator,
140-
WindowOperator, TopKOperator, drainPipeline,
141-
type Operator, type RowBatch,
142-
} from "querymode"
143-
144-
// Your data source — any async batch producer
145-
const source: Operator = {
146-
async next() { /* return RowBatch or null */ },
147-
async close() {},
148-
}
149-
150-
// Chain operators — no query planner, no SQL string
151-
const filtered = new FilterOperator(source, [{ column: "age", op: "gt", value: 25 }])
152-
const aggregated = new AggregateOperator(filtered, {
153-
table: "users", filters: [], projections: [],
154-
groupBy: ["region"],
155-
aggregates: [{ fn: "sum", column: "amount", alias: "total" }],
156-
})
157-
// "HAVING" is just a filter after aggregation — same operator, you control order
158-
const having = new FilterOperator(aggregated, [{ column: "total", op: "gt", value: 1000 }])
159-
const top10 = new TopKOperator(having, "total", true, 10)
160-
161-
// Pull results — zero-copy, no serialization between stages
162-
const rows = await drainPipeline(top10)
163-
```
164-
165-
### Or use the DataFrame API
166-
167-
The same operators power the fluent API — `.filter()` becomes `FilterOperator`, `.sort()` becomes `ExternalSortOperator`, etc:
168-
169-
```typescript
170-
const qm = QueryMode.local()
171-
const results = await qm
172-
.table("orders")
173-
.filter("amount", "gt", 100)
174-
.groupBy("region")
175-
.aggregate("sum", "amount", "total")
176-
.sort("total", "desc")
177-
.limit(10)
178-
.exec()
179-
```
180-
181-
Both paths produce the same pull-based pipeline. The DataFrame API is sugar; the operators are the engine.
182-
183-
### Memory-bounded with R2 spill
184-
185-
Operators that accumulate state (sort, join, aggregate) accept a memory budget. When exceeded, they spill to R2 via `SpillBackend` — same interface whether running on Cloudflare edge or local disk:
186-
187-
```typescript
188-
import { HashJoinOperator, ExternalSortOperator, R2SpillBackend } from "querymode"
189-
190-
const spill = new R2SpillBackend(env.DATA_BUCKET, "__spill/query-123")
191-
const join = new HashJoinOperator(left, right, "user_id", "id", "inner", 32 * 1024 * 1024, spill)
192-
const sorted = new ExternalSortOperator(join, "created_at", true, 0, 32 * 1024 * 1024, spill)
193-
const rows = await drainPipeline(sorted)
194-
await spill.cleanup()
195-
```
196-
197-
### SQL frontend
198-
199-
SQL is another way in — same operator pipeline underneath:
200-
201-
```typescript
202-
const qm = QueryMode.local()
203-
const results = await qm
204-
.sql("SELECT region, SUM(amount) AS total FROM orders WHERE status = 'active' GROUP BY region ORDER BY total DESC LIMIT 10")
205-
.collect()
20635

207-
// SQL and DataFrame compose — chain further operations after SQL
208-
const filtered = await qm
209-
.sql("SELECT * FROM events WHERE created_at > '2026-01-01'")
210-
.filter("country", "eq", "US")
211-
.sort("amount", "desc")
212-
.limit(50)
36+
// SQL works too — same operator pipeline underneath
37+
const sql = await qm
38+
.sql("SELECT region, SUM(amount) AS total FROM orders GROUP BY region ORDER BY total DESC")
21339
.collect()
214-
```
215-
216-
Supports: SELECT, WHERE (AND/OR/NOT, LIKE, NOT LIKE, IN, NOT IN, BETWEEN, NOT BETWEEN, IS NULL, IS NOT NULL), GROUP BY, HAVING, ORDER BY (multi-column), LIMIT/OFFSET, DISTINCT, CASE/CAST, arithmetic expressions, JOINs, window functions (ROW_NUMBER, RANK, LAG, LEAD), UNION/INTERSECT/EXCEPT.
217-
218-
### Why this matters
21940

220-
Traditional engines give you a fixed query language. You can't put a window function before a join, run custom logic between pipeline stages, or swap the sort implementation. The planner decides.
221-
222-
With QueryMode, operators are building blocks. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
223-
224-
### Beyond traditional engines
225-
226-
These examples show what's possible when operators are composable building blocks, not a fixed plan:
227-
228-
| Example | What it shows | Why DuckDB/Polars can't |
229-
|---------|--------------|------------------------|
230-
| [`examples/ml-scoring-pipeline.ts`](examples/ml-scoring-pipeline.ts) | Custom scoring runs **inside** the pipeline between Filter and TopK | UDFs serialize data across the engine boundary |
231-
| [`examples/adaptive-search.ts`](examples/adaptive-search.ts) | Vector search with adaptive threshold — recompose if too few results | Fixed query planner can't dynamically widen search |
232-
| [`examples/custom-spill-backend.ts`](examples/custom-spill-backend.ts) | Pluggable spill storage (memory, R2, S3) at 4KB budget | DuckDB: disk only. Polars: no spill at all |
233-
| [`examples/nextjs-api-route.ts`](examples/nextjs-api-route.ts) | Next.js/Vinext API route — query Parquet files, deploy to edge | DuckDB needs a sidecar process, can't run in Workers |
234-
235-
Run any example:
236-
```bash
237-
npx tsx examples/ml-scoring-pipeline.ts
238-
npx tsx examples/adaptive-search.ts
239-
npx tsx examples/custom-spill-backend.ts
240-
npx tsx examples/nextjs-api-route.ts
41+
// Edge mode — same API, WASM runs inside regional DOs
42+
const edge = QueryMode.remote(env.QUERY_DO, { region: "SJC" })
24143
```
24244

243-
## What exists
244-
245-
- **TypeScript orchestration** — Durable Object lifecycle, R2 range reads, footer caching, request routing
246-
- **Zig WASM engine** (`wasm/`) — column decoding, SIMD ops, SQL execution, vector search, fragment writing, compiles to `querymode.wasm`
247-
- **Code-first query API**`.table().filter().select().sort().limit().exec()` or `.sql("SELECT ...")`, with `.toCode()` decompiler for logging and LLM context compression
248-
- **Write path**`append(rows, { path, metadata })` with CAS-based manifest coordination via Master DO, `dropTable()` for cleanup
249-
- **Master/Query DO split** — single-writer Master broadcasts footer invalidations to per-region Query DOs
250-
- **Footer caching** — table footers (~4KB each) cached in DO memory with VIP eviction (hot tables protected from eviction)
251-
- **Bounded prefetch pipeline** — R2 range fetches overlap I/O (fetch page N+1 while WASM processes page N)
252-
- **IVF-PQ vector search** — index-aware routing in Query DO, falls back to flat SIMD search when no index present
253-
- **Multi-format support** — Lance, Parquet, and Iceberg tables
254-
- **Local mode** — same API reads Lance/Parquet files from disk or HTTP (Node/Bun)
255-
- **Fragment DO pool** — fan-out parallel scanning for multi-fragment datasets (one DO per fragment, scales with data)
256-
- **600+ tests** — unit tests cover footer parsing, column decoding, Parquet/Thrift, merging, aggregates, VIP cache, WASM integration, SQL, partition catalog, materialized executor, toCode decompiler; 110+ conformance tests validate every operator against DuckDB at 1M-5M row scale
257-
- **CI benchmarks** — head-to-head QueryMode (Miniflare) vs DuckDB (native) on every push, results posted to [GitHub Actions summary](https://github.com/teamchong/querymode/actions/workflows/ci.yml)
45+
## What it is
25846

259-
## What doesn't exist yet
47+
Operators are composable building blocks, not a fixed query plan. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
26048

261-
- No deployed instance
262-
- No browser mode
263-
- No npm package published (install from source via git clone)
49+
| Layer | What |
50+
|-------|------|
51+
| **Zig WASM engine** | Column decoding, SIMD filtering, SQL execution, vector search |
52+
| **TypeScript orchestration** | DO lifecycle, R2 range reads, footer caching, request routing |
53+
| **Code-first API** | `.table().filter().sort().exec()` or `.sql("SELECT ...")` |
54+
| **Edge runtime** | Master/Query/Fragment DOs, R2 spill, multi-bucket sharding |
26455

265-
## Architecture
266-
![querymode-architecture](docs/architecture/querymode-architecture.svg)
56+
14 operators (filter, project, aggregate, sort, join, window, distinct, set ops, limit, sample, computed columns, subquery-in, top-K, vector search), all pull-based with the same `next() → RowBatch | null` interface.
26757

26858
## Build
26959

27060
```bash
271-
pnpm install # install dependencies
272-
pnpm build:ts # typecheck only (no WASM rebuild needed — pre-built WASM included)
273-
pnpm test:node # run node tests (~2 min)
274-
pnpm test:workers # run workerd tests
275-
pnpm test # run all tests (~8 min)
61+
pnpm build:ts # typecheck (pre-built WASM included)
62+
pnpm test:node # node tests (~2 min)
63+
pnpm test:workers # workerd tests
64+
pnpm test # all tests (~8 min)
27665
pnpm dev # local dev with wrangler
27766

27867
# Rebuild WASM from Zig source (requires zig toolchain)
279-
# Install: https://ziglang.org/download/
280-
pnpm wasm # cd wasm && zig build wasm && cp to src/wasm/
281-
```
282-
283-
## Query API
284-
285-
```typescript
286-
import { QueryMode } from "querymode"
287-
288-
// Local mode — query files directly where they sit
289-
const qm = QueryMode.local()
290-
const results = await qm
291-
.table("./data/users.lance")
292-
.filter("age", "gt", 25)
293-
.select("name", "email")
294-
.exec()
295-
296-
// Edge mode — same API, WASM runs inside regional DOs
297-
const qm = QueryMode.remote(env.QUERY_DO, { region: "SJC" })
298-
const results = await qm
299-
.table("users")
300-
.filter("age", "gt", 25)
301-
.select("name", "email")
302-
.sort("age", "desc")
303-
.limit(100)
304-
.exec()
305-
306-
// JOINs are code, not SQL — your logic, zero materialization
307-
const orders = await qm.table("orders").filter("amount", "gt", 100).exec()
308-
const userIds = orders.rows.map(r => r.user_id)
309-
const users = await qm.table("users").filter("id", "in", userIds).exec()
310-
const enriched = orders.rows.map(o => ({
311-
...o,
312-
user: users.rows.find(u => u.id === o.user_id)
313-
}))
314-
315-
// Write path (append rows)
316-
await qm.table("users").append([
317-
{ id: 1, name: "Alice", age: 30 },
318-
{ id: 2, name: "Bob", age: 25 },
319-
])
320-
321-
// Write to specific path with metadata (catalog-friendly)
322-
await qm.table("enriched").append(rows, {
323-
path: "pipelines/job-abc/enriched.lance/",
324-
metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
325-
})
326-
327-
// Drop table (cleanup)
328-
await qm.table("enriched").dropTable()
329-
330-
// Vector search (flat or IVF-PQ accelerated)
331-
const similar = await qm
332-
.table("images")
333-
.vector("embedding", queryVec, 10)
334-
.exec()
68+
pnpm wasm
33569
```
33670

337-
## MapReduce over the network
338-
339-
WASM is slower than native (~1.3–1.5× overhead), and a single Durable Object has hard memory and CPU caps. You can't build a competitive query engine by running everything in one WASM instance on one node.
71+
## What exists
34072

341-
QueryMode doesn't try. It uses the network as a distributed compute fabric — like biological cells, not a brain:
73+
- 600+ tests, 110+ conformance tests validated against DuckDB at 1M-5M row scale
74+
- CI benchmarks: QueryMode (Miniflare) vs DuckDB (native) on every push
75+
- Multi-format: Lance, Parquet, Iceberg, CSV, JSON
76+
- Memory-bounded operators with R2 spill (sort, join, aggregate)
77+
- IVF-PQ vector search with flat SIMD fallback
78+
- Zero-copy columnar pipeline (QMCB binary format, no Row[] until response boundary)
79+
- Local mode (Node/Bun) and edge mode (Cloudflare Workers)
34280

343-
- **DOs as cells** — every Fragment DO carries the same WASM binary (DNA). They activate on signal, scan their fragment, and go dormant. More data → more cells. Idle cells cost nothing (they hibernate).
344-
- **R2 as virtual memory** — when a single DO's 128MB fills up, operators spill to R2. The pipeline doesn't care if data is in-memory or spilled — same interface, unbounded capacity.
345-
- **Fan-out as bandwidth** — more fragments = more parallel R2 reads = more aggregate throughput. No cell coordinates with another — they all respond to the same signal independently.
81+
## What doesn't exist yet
34682

347-
QueryDO **maps** fragments to Fragment DOs, each DO runs WASM SIMD on its shard, then QueryDO **reduces** via k-way merge. No single node does heavy work. The code is the DNA — scale comes from more cells, not smarter ones. See [Architecture](https://teamchong.github.io/querymode/architecture/) for the full deep dive.
83+
- No deployed instance
84+
- No browser mode
85+
- No npm package published (install from source)
34886

34987
## License
35088

0 commit comments

Comments
 (0)