docs: trim README from 351 to 89 lines — link to docs instead of duplicating

teamchong · teamchong · commit f490ab1aa41d · 2026-03-17T16:06:04.000-04:00
Kept: quickstart, pitch, build commands, status.
Removed: operator table, architecture diagram, full API reference, examples,
MapReduce explanation — all live in the docs site.
diff --git a/README.md b/README.md
@@ -2,10 +2,13 @@
 
 > **Experimental** — early prototype, not production-ready. Architecture and API will change.
 
+A pluggable columnar query library — not a query engine you push data to, but a query capability your code uses directly. No data materialization, no engine boundary, no SQL transpilation.
+
+**[Docs](https://teamchong.github.io/querymode/)** · **[Why QueryMode?](https://teamchong.github.io/querymode/why-querymode/)** · **[Architecture](https://teamchong.github.io/querymode/architecture/)**
+
 ## Quickstart
 
 ```bash
-# Clone and use from source (not yet published to npm)
 git clone https://github.com/teamchong/querymode.git
 cd querymode && pnpm install
 ```
@@ -14,337 +17,72 @@ cd querymode && pnpm install
 import { QueryMode } from "querymode/local"
 
 // Zero-config: demo data, no files needed
-const demo = QueryMode.demo()
-const top5 = await demo
+const top5 = await QueryMode.demo()
   .filter("category", "eq", "Electronics")
   .sort("amount", "desc")
   .limit(5)
   .collect()
 
-console.log(top5.rows)
-
-// Or query your own files — Parquet, Lance, CSV, JSON, Arrow
+// Query your own files — Parquet, Lance, CSV, JSON
 const qm = QueryMode.local()
 const result = await qm
   .table("./data/events.parquet")
   .filter("status", "eq", "active")
-  .filter("amount", "gte", 100)
-  .filter("amount", "lte", 500)
   .select("id", "amount", "region")
   .sort("amount", "desc")
   .limit(20)
   .collect()
-```
-
-A pluggable columnar query library — not a query engine you push data to, but a query capability your code uses directly. No data materialization, no engine boundary, no SQL transpilation.
-
-**[Why QueryMode?](https://teamchong.github.io/querymode/why-querymode/)** — Agents need dynamic pipelines, not pre-built ETL. QueryMode lets the agent define both query and business logic in the same code, at query time, with no serialization boundary between stages.
-
-## Why "mode" not "engine"
-
-Every query engine — Spark, DataFusion, DuckDB, Polars — has a boundary between your code and the engine:
-
-```
-Traditional engine:
-
-  Your Code                      Engine
-  ─────────                      ──────
-  filter(age > 25)  ──────►   translate to internal plan
-                               materialize data into Arrow/DataFrame
-                               run engine's fixed operators
-                               serialize results
-                    ◄──────   return results to your code
-
-  Your code CANNOT cross the boundary.
-  Custom business logic? Pull data out, process in your code, push back in.
-  That round-trip IS data materialization.
-```
-
-LINQ and ORMs look like code-first but they're transpiling expressions to SQL strings sent to a separate database. The database still materializes your data into its format, runs its fixed operators, and sends results back.
-
-QueryMode has no boundary:
-
-```typescript
-// This IS the execution — not a description translated to SQL
-const orders = await qm.table("orders").filter("amount", "gt", 100).exec()
-const userIds = orders.rows.map(r => r.user_id)  // your code, zero materialization
-const users = await qm.table("users").filter("id", "in", userIds).exec()
-
-// JOIN logic, business rules, ML scoring — all your code
-// WASM handles byte-level column decode + SIMD filtering
-// But orchestration is YOUR code, not a query planner's fixed operators
-```
-
-Your app code IS the query execution. The WASM engine is a library function your code calls — column decoding, SIMD filtering, vector search happen in-process, on raw bytes, zero-copy. There's no "register UDF → engine materializes data → calls your function → collects results" boundary.
-
-**What this means in practice:**
-- No data materialization — data stays in R2/disk, only the exact matching bytes are read
-- No engine boundary — your business logic runs directly, not as a registered UDF
-- No SQL transpilation — the API calls ARE the execution, not a description sent elsewhere
-- No fixed operator set — your code can do anything between query steps
-- Same binary everywhere — browser, Node/Bun, Cloudflare DO
-
-## Query engine as code
-
-Every query operation is a composable code primitive. They all implement the same pull-based `Operator` interface — `next() → RowBatch | null` — so you chain them however you want.
-
-```
-Operation         Operator class              What it does
-─────────         ──────────────              ────────────
-
-Filtering
-  predicate       FilterOperator              14 ops: eq, neq, gt, gte, lt, lte, in, not_in,
-                                              between, not_between, like, not_like, is_null, is_not_null
-  membership      SubqueryInOperator          Semi-join filter against a value set
-
-Projection
-  select          ProjectOperator             Column selection
-  transform       ComputedColumnOperator      Arbitrary (row: Row) => value per row
-
-Aggregation
-  group + reduce  AggregateOperator           sum, avg, min, max, count, count_distinct,
-                                              stddev, variance, median, percentile
-  having          FilterOperator              Filter after AggregateOperator — same primitive, you control order
-
-Sorting
-  full sort       ExternalSortOperator        Disk-spilling merge sort with R2 spill
-  in-memory sort  InMemorySortOperator        In-memory sort (small datasets)
-  top-K           TopKOperator                Heap-based top-K without full sort
-
-Joining
-  hash join       HashJoinOperator            inner, left, right, full, cross — Grace hash join with R2 spill
-
-Windowing
-  partition       WindowOperator              row_number, rank, dense_rank, lag, lead,
-                                              rolling sum/avg/min/max/count
-
-Deduplication
-  distinct        DistinctOperator            Hash-based deduplication on column set
-
-Set operations
-  combine         SetOperator                 union, union_all, intersect, except
-
-Limiting
-  limit/offset    LimitOperator               Row limiting with offset
-  sample          DataFrame.sample()            Random sampling (Fisher-Yates)
-
-Similarity
-  vector near     (planned)                   NEAR topK as composable operator — currently in scan layer
-```
-
-### Compose operators directly
-
-```typescript
-import {
-  FilterOperator, AggregateOperator, HashJoinOperator,
-  WindowOperator, TopKOperator, drainPipeline,
-  type Operator, type RowBatch,
-} from "querymode"
-
-// Your data source — any async batch producer
-const source: Operator = {
-  async next() { /* return RowBatch or null */ },
-  async close() {},
-}
-
-// Chain operators — no query planner, no SQL string
-const filtered = new FilterOperator(source, [{ column: "age", op: "gt", value: 25 }])
-const aggregated = new AggregateOperator(filtered, {
-  table: "users", filters: [], projections: [],
-  groupBy: ["region"],
-  aggregates: [{ fn: "sum", column: "amount", alias: "total" }],
-})
-// "HAVING" is just a filter after aggregation — same operator, you control order
-const having = new FilterOperator(aggregated, [{ column: "total", op: "gt", value: 1000 }])
-const top10 = new TopKOperator(having, "total", true, 10)
-
-// Pull results — zero-copy, no serialization between stages
-const rows = await drainPipeline(top10)
-```
-
-### Or use the DataFrame API
-
-The same operators power the fluent API — `.filter()` becomes `FilterOperator`, `.sort()` becomes `ExternalSortOperator`, etc:
-
-```typescript
-const qm = QueryMode.local()
-const results = await qm
-  .table("orders")
-  .filter("amount", "gt", 100)
-  .groupBy("region")
-  .aggregate("sum", "amount", "total")
-  .sort("total", "desc")
-  .limit(10)
-  .exec()
-```
-
-Both paths produce the same pull-based pipeline. The DataFrame API is sugar; the operators are the engine.
-
-### Memory-bounded with R2 spill
-
-Operators that accumulate state (sort, join, aggregate) accept a memory budget. When exceeded, they spill to R2 via `SpillBackend` — same interface whether running on Cloudflare edge or local disk:
-
-```typescript
-import { HashJoinOperator, ExternalSortOperator, R2SpillBackend } from "querymode"
-
-const spill = new R2SpillBackend(env.DATA_BUCKET, "__spill/query-123")
-const join = new HashJoinOperator(left, right, "user_id", "id", "inner", 32 * 1024 * 1024, spill)
-const sorted = new ExternalSortOperator(join, "created_at", true, 0, 32 * 1024 * 1024, spill)
-const rows = await drainPipeline(sorted)
-await spill.cleanup()
-```
-
-### SQL frontend
-
-SQL is another way in — same operator pipeline underneath:
-
-```typescript
-const qm = QueryMode.local()
-const results = await qm
-  .sql("SELECT region, SUM(amount) AS total FROM orders WHERE status = 'active' GROUP BY region ORDER BY total DESC LIMIT 10")
-  .collect()
 
-// SQL and DataFrame compose — chain further operations after SQL
-const filtered = await qm
-  .sql("SELECT * FROM events WHERE created_at > '2026-01-01'")
-  .filter("country", "eq", "US")
-  .sort("amount", "desc")
-  .limit(50)
+// SQL works too — same operator pipeline underneath
+const sql = await qm
+  .sql("SELECT region, SUM(amount) AS total FROM orders GROUP BY region ORDER BY total DESC")
   .collect()
-```
-
-Supports: SELECT, WHERE (AND/OR/NOT, LIKE, NOT LIKE, IN, NOT IN, BETWEEN, NOT BETWEEN, IS NULL, IS NOT NULL), GROUP BY, HAVING, ORDER BY (multi-column), LIMIT/OFFSET, DISTINCT, CASE/CAST, arithmetic expressions, JOINs, window functions (ROW_NUMBER, RANK, LAG, LEAD), UNION/INTERSECT/EXCEPT.
-
-### Why this matters
 
-Traditional engines give you a fixed query language. You can't put a window function before a join, run custom logic between pipeline stages, or swap the sort implementation. The planner decides.
-
-With QueryMode, operators are building blocks. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
-
-### Beyond traditional engines
-
-These examples show what's possible when operators are composable building blocks, not a fixed plan:
-
-| Example | What it shows | Why DuckDB/Polars can't |
-|---------|--------------|------------------------|
-| [`examples/ml-scoring-pipeline.ts`](examples/ml-scoring-pipeline.ts) | Custom scoring runs **inside** the pipeline between Filter and TopK | UDFs serialize data across the engine boundary |
-| [`examples/adaptive-search.ts`](examples/adaptive-search.ts) | Vector search with adaptive threshold — recompose if too few results | Fixed query planner can't dynamically widen search |
-| [`examples/custom-spill-backend.ts`](examples/custom-spill-backend.ts) | Pluggable spill storage (memory, R2, S3) at 4KB budget | DuckDB: disk only. Polars: no spill at all |
-| [`examples/nextjs-api-route.ts`](examples/nextjs-api-route.ts) | Next.js/Vinext API route — query Parquet files, deploy to edge | DuckDB needs a sidecar process, can't run in Workers |
-
-Run any example:
-```bash
-npx tsx examples/ml-scoring-pipeline.ts
-npx tsx examples/adaptive-search.ts
-npx tsx examples/custom-spill-backend.ts
-npx tsx examples/nextjs-api-route.ts
+// Edge mode — same API, WASM runs inside regional DOs
+const edge = QueryMode.remote(env.QUERY_DO, { region: "SJC" })
 ```
 
-## What exists
-
-- **TypeScript orchestration** — Durable Object lifecycle, R2 range reads, footer caching, request routing
-- **Zig WASM engine** (`wasm/`) — column decoding, SIMD ops, SQL execution, vector search, fragment writing, compiles to `querymode.wasm`
-- **Code-first query API** — `.table().filter().select().sort().limit().exec()` or `.sql("SELECT ...")`, with `.toCode()` decompiler for logging and LLM context compression
-- **Write path** — `append(rows, { path, metadata })` with CAS-based manifest coordination via Master DO, `dropTable()` for cleanup
-- **Master/Query DO split** — single-writer Master broadcasts footer invalidations to per-region Query DOs
-- **Footer caching** — table footers (~4KB each) cached in DO memory with VIP eviction (hot tables protected from eviction)
-- **Bounded prefetch pipeline** — R2 range fetches overlap I/O (fetch page N+1 while WASM processes page N)
-- **IVF-PQ vector search** — index-aware routing in Query DO, falls back to flat SIMD search when no index present
-- **Multi-format support** — Lance, Parquet, and Iceberg tables
-- **Local mode** — same API reads Lance/Parquet files from disk or HTTP (Node/Bun)
-- **Fragment DO pool** — fan-out parallel scanning for multi-fragment datasets (one DO per fragment, scales with data)
-- **600+ tests** — unit tests cover footer parsing, column decoding, Parquet/Thrift, merging, aggregates, VIP cache, WASM integration, SQL, partition catalog, materialized executor, toCode decompiler; 110+ conformance tests validate every operator against DuckDB at 1M-5M row scale
-- **CI benchmarks** — head-to-head QueryMode (Miniflare) vs DuckDB (native) on every push, results posted to [GitHub Actions summary](https://github.com/teamchong/querymode/actions/workflows/ci.yml)
+## What it is
 
-## What doesn't exist yet
+Operators are composable building blocks, not a fixed query plan. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
 
-- No deployed instance
-- No browser mode
-- No npm package published (install from source via git clone)
+| Layer | What |
+|-------|------|
+| **Zig WASM engine** | Column decoding, SIMD filtering, SQL execution, vector search |
+| **TypeScript orchestration** | DO lifecycle, R2 range reads, footer caching, request routing |
+| **Code-first API** | `.table().filter().sort().exec()` or `.sql("SELECT ...")` |
+| **Edge runtime** | Master/Query/Fragment DOs, R2 spill, multi-bucket sharding |
 
-## Architecture
-![querymode-architecture](docs/architecture/querymode-architecture.svg)
+14 operators (filter, project, aggregate, sort, join, window, distinct, set ops, limit, sample, computed columns, subquery-in, top-K, vector search), all pull-based with the same `next() → RowBatch | null` interface.
 
 ## Build
 
 ```bash
-pnpm install          # install dependencies
-pnpm build:ts         # typecheck only (no WASM rebuild needed — pre-built WASM included)
-pnpm test:node        # run node tests (~2 min)
-pnpm test:workers     # run workerd tests
-pnpm test             # run all tests (~8 min)
+pnpm build:ts         # typecheck (pre-built WASM included)
+pnpm test:node        # node tests (~2 min)
+pnpm test:workers     # workerd tests
+pnpm test             # all tests (~8 min)
 pnpm dev              # local dev with wrangler
 
 # Rebuild WASM from Zig source (requires zig toolchain)
-# Install: https://ziglang.org/download/
-pnpm wasm             # cd wasm && zig build wasm && cp to src/wasm/
-```
-
-## Query API
-
-```typescript
-import { QueryMode } from "querymode"
-
-// Local mode — query files directly where they sit
-const qm = QueryMode.local()
-const results = await qm
-  .table("./data/users.lance")
-  .filter("age", "gt", 25)
-  .select("name", "email")
-  .exec()
-
-// Edge mode — same API, WASM runs inside regional DOs
-const qm = QueryMode.remote(env.QUERY_DO, { region: "SJC" })
-const results = await qm
-  .table("users")
-  .filter("age", "gt", 25)
-  .select("name", "email")
-  .sort("age", "desc")
-  .limit(100)
-  .exec()
-
-// JOINs are code, not SQL — your logic, zero materialization
-const orders = await qm.table("orders").filter("amount", "gt", 100).exec()
-const userIds = orders.rows.map(r => r.user_id)
-const users = await qm.table("users").filter("id", "in", userIds).exec()
-const enriched = orders.rows.map(o => ({
-  ...o,
-  user: users.rows.find(u => u.id === o.user_id)
-}))
-
-// Write path (append rows)
-await qm.table("users").append([
-  { id: 1, name: "Alice", age: 30 },
-  { id: 2, name: "Bob", age: 25 },
-])
-
-// Write to specific path with metadata (catalog-friendly)
-await qm.table("enriched").append(rows, {
-  path: "pipelines/job-abc/enriched.lance/",
-  metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
-})
-
-// Drop table (cleanup)
-await qm.table("enriched").dropTable()
-
-// Vector search (flat or IVF-PQ accelerated)
-const similar = await qm
-  .table("images")
-  .vector("embedding", queryVec, 10)
-  .exec()
+pnpm wasm
 ```
 
-## MapReduce over the network
-
-WASM is slower than native (~1.3–1.5× overhead), and a single Durable Object has hard memory and CPU caps. You can't build a competitive query engine by running everything in one WASM instance on one node.
+## What exists
 
-QueryMode doesn't try. It uses the network as a distributed compute fabric — like biological cells, not a brain:
+- 600+ tests, 110+ conformance tests validated against DuckDB at 1M-5M row scale
+- CI benchmarks: QueryMode (Miniflare) vs DuckDB (native) on every push
+- Multi-format: Lance, Parquet, Iceberg, CSV, JSON
+- Memory-bounded operators with R2 spill (sort, join, aggregate)
+- IVF-PQ vector search with flat SIMD fallback
+- Zero-copy columnar pipeline (QMCB binary format, no Row[] until response boundary)
+- Local mode (Node/Bun) and edge mode (Cloudflare Workers)
 
-- **DOs as cells** — every Fragment DO carries the same WASM binary (DNA). They activate on signal, scan their fragment, and go dormant. More data → more cells. Idle cells cost nothing (they hibernate).
-- **R2 as virtual memory** — when a single DO's 128MB fills up, operators spill to R2. The pipeline doesn't care if data is in-memory or spilled — same interface, unbounded capacity.
-- **Fan-out as bandwidth** — more fragments = more parallel R2 reads = more aggregate throughput. No cell coordinates with another — they all respond to the same signal independently.
+## What doesn't exist yet
 
-QueryDO **maps** fragments to Fragment DOs, each DO runs WASM SIMD on its shard, then QueryDO **reduces** via k-way merge. No single node does heavy work. The code is the DNA — scale comes from more cells, not smarter ones. See [Architecture](https://teamchong.github.io/querymode/architecture/) for the full deep dive.
+- No deployed instance
+- No browser mode
+- No npm package published (install from source)
 
 ## License