Skip to content

Commit 958ea66

Browse files
committed
feat: zero-config entry, observability, introspection, and composability examples
- Add fromJSON(), fromCSV(), demo() convenience factories for zero-config usage - Add formatResultSummary() and formatExplain() observability formatters - Add describe() and head() methods to DataFrame for data introspection - Add onProgress callback to collect() for long-running query feedback - Add per-phase timing (scanMs, pipelineMs, metaMs) to local executor - Add sort support to MaterializedExecutor - Export parseCsvFull from csv-reader for reuse - Improve TABLE_NOT_FOUND and INVALID_FORMAT error messages - Add 4 composability examples: ML scoring, adaptive search, custom spill, Next.js/Vinext - Fix README: remove non-existent .whereBetween(), link CI benchmarks - 25 new tests (convenience + format)
1 parent e4e1af7 commit 958ea66

16 files changed

+1282
-16
lines changed

README.md

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,38 @@
22

33
> **Experimental** — early prototype, not production-ready. Architecture and API will change.
44
5+
## Quickstart
6+
7+
```bash
8+
pnpm add querymode
9+
```
10+
11+
```typescript
12+
import { QueryMode } from "querymode/local"
13+
14+
// Zero-config: demo data, no files needed
15+
const demo = QueryMode.demo()
16+
const top5 = await demo
17+
.filter("category", "eq", "Electronics")
18+
.sort("amount", "desc")
19+
.limit(5)
20+
.collect()
21+
22+
console.log(top5.rows)
23+
24+
// Or query your own files — Parquet, Lance, CSV, JSON, Arrow
25+
const qm = QueryMode.local()
26+
const result = await qm
27+
.table("./data/events.parquet")
28+
.filter("status", "eq", "active")
29+
.filter("amount", "gte", 100)
30+
.filter("amount", "lte", 500)
31+
.select("id", "amount", "region")
32+
.sort("amount", "desc")
33+
.limit(20)
34+
.collect()
35+
```
36+
537
A pluggable columnar query library — not a query engine you push data to, but a query capability your code uses directly. No data materialization, no engine boundary, no SQL transpilation.
638

739
## Why "mode" not "engine"
@@ -167,6 +199,25 @@ Traditional engines give you a fixed query language. You can't put a window func
167199

168200
With QueryMode, operators are building blocks. Your code assembles the pipeline, controls the memory budget, decides when to spill. The query engine isn't a service you call — it's a library your code composes.
169201

202+
### Beyond traditional engines
203+
204+
These examples show what's possible when operators are composable building blocks, not a fixed plan:
205+
206+
| Example | What it shows | Why DuckDB/Polars can't |
207+
|---------|--------------|------------------------|
208+
| [`examples/ml-scoring-pipeline.ts`](examples/ml-scoring-pipeline.ts) | Custom scoring runs **inside** the pipeline between Filter and TopK | UDFs serialize data across the engine boundary |
209+
| [`examples/adaptive-search.ts`](examples/adaptive-search.ts) | Vector search with adaptive threshold — recompose if too few results | Fixed query planner can't dynamically widen search |
210+
| [`examples/custom-spill-backend.ts`](examples/custom-spill-backend.ts) | Pluggable spill storage (memory, R2, S3) at 4KB budget | DuckDB: disk only. Polars: no spill at all |
211+
| [`examples/nextjs-api-route.ts`](examples/nextjs-api-route.ts) | Next.js/Vinext API route — query Parquet files, deploy to edge | DuckDB needs a sidecar process, can't run in Workers |
212+
213+
Run any example:
214+
```bash
215+
npx tsx examples/ml-scoring-pipeline.ts
216+
npx tsx examples/adaptive-search.ts
217+
npx tsx examples/custom-spill-backend.ts
218+
npx tsx examples/nextjs-api-route.ts
219+
```
220+
170221
## What exists
171222

172223
- **TypeScript orchestration** — Durable Object lifecycle, R2 range reads, footer caching, request routing
@@ -181,7 +232,7 @@ With QueryMode, operators are building blocks. Your code assembles the pipeline,
181232
- **Local mode** — same API reads Lance/Parquet files from disk or HTTP (Node/Bun)
182233
- **Fragment DO pool** — fan-out parallel scanning for multi-fragment datasets (max 20 slots per datacenter)
183234
- **112 unit tests + 26 conformance tests** — unit tests cover footer parsing, column decoding, Parquet/Thrift, merging, aggregates, VIP cache, WASM integration; conformance tests validate every operator against DuckDB at 1M-5M row scale
184-
- **CI benchmarks** — head-to-head QueryMode (Miniflare) vs DuckDB (native) on every push, results posted to GitHub Actions summary
235+
- **CI benchmarks** — head-to-head QueryMode (Miniflare) vs DuckDB (native) on every push, results posted to [GitHub Actions summary](https://github.com/teamchong/querymode/actions/workflows/ci.yml)
185236

186237
## What doesn't exist yet
187238

examples/adaptive-search.ts

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
/**
2+
* Adaptive Vector Search — dynamically widen search when too few results match.
3+
*
4+
* Traditional query planners have a fixed execution plan. Here, the pipeline
5+
* is recomposed at runtime: if the initial distance threshold yields too few
6+
* results, we widen it and search again. Impossible with a fixed query planner.
7+
*
8+
* Pipeline (per iteration):
9+
* MockOperator(vectors) → VectorTopKOperator(top-50) → FilterOperator(_distance < threshold)
10+
*/
11+
12+
import {
13+
FilterOperator,
14+
TopKOperator,
15+
drainPipeline,
16+
type Operator,
17+
type RowBatch,
18+
} from "../src/operators.js";
19+
import { cosineDistance } from "../src/hnsw.js";
20+
import type { Row } from "../src/types.js";
21+
22+
// ─── Mock vector data ───────────────────────────────────────────────────
23+
24+
class MockOperator implements Operator {
25+
private rows: Row[];
26+
private cursor = 0;
27+
private batchSize: number;
28+
constructor(rows: Row[], batchSize = 100) {
29+
this.rows = rows;
30+
this.batchSize = batchSize;
31+
}
32+
async next(): Promise<RowBatch | null> {
33+
if (this.cursor >= this.rows.length) return null;
34+
const batch = this.rows.slice(this.cursor, this.cursor + this.batchSize);
35+
this.cursor += this.batchSize;
36+
return batch;
37+
}
38+
async close() {}
39+
}
40+
41+
/**
42+
* Custom operator: compute cosine distance for each row and add _distance column.
43+
* This is the kind of operator you can compose freely in QueryMode.
44+
*/
45+
class VectorDistanceOperator implements Operator {
46+
private upstream: Operator;
47+
private column: string;
48+
private queryVec: Float32Array;
49+
constructor(upstream: Operator, column: string, queryVec: Float32Array) {
50+
this.upstream = upstream;
51+
this.column = column;
52+
this.queryVec = queryVec;
53+
}
54+
async next(): Promise<RowBatch | null> {
55+
const batch = await this.upstream.next();
56+
if (!batch) return null;
57+
return batch.map(row => ({
58+
...row,
59+
_distance: cosineDistance(row[this.column] as Float32Array, this.queryVec),
60+
}));
61+
}
62+
async close() { await this.upstream.close(); }
63+
}
64+
65+
// Deterministic PRNG
66+
let prngState = 12345;
67+
function xorshift32(): number {
68+
prngState ^= prngState << 13;
69+
prngState ^= prngState >>> 17;
70+
prngState ^= prngState << 5;
71+
return (prngState >>> 0) / 0xFFFFFFFF;
72+
}
73+
74+
// Generate 200 items with 8-dim embeddings
75+
const DIM = 8;
76+
const items: Row[] = Array.from({ length: 200 }, (_, i) => {
77+
const embedding = new Float32Array(DIM);
78+
for (let d = 0; d < DIM; d++) embedding[d] = xorshift32() * 2 - 1;
79+
// Normalize
80+
let norm = 0;
81+
for (let d = 0; d < DIM; d++) norm += embedding[d] * embedding[d];
82+
norm = Math.sqrt(norm);
83+
for (let d = 0; d < DIM; d++) embedding[d] /= norm;
84+
return { id: i + 1, label: `item_${i + 1}`, embedding };
85+
});
86+
87+
// Query vector (normalized)
88+
const queryVec = new Float32Array(DIM);
89+
for (let d = 0; d < DIM; d++) queryVec[d] = xorshift32() * 2 - 1;
90+
let qNorm = 0;
91+
for (let d = 0; d < DIM; d++) qNorm += queryVec[d] * queryVec[d];
92+
qNorm = Math.sqrt(qNorm);
93+
for (let d = 0; d < DIM; d++) queryVec[d] /= qNorm;
94+
95+
// ─── Adaptive search ────────────────────────────────────────────────────
96+
97+
const MIN_RESULTS = 10;
98+
const INITIAL_THRESHOLD = 0.3;
99+
const WIDEN_STEP = 0.15;
100+
const MAX_THRESHOLD = 1.5;
101+
102+
async function adaptiveSearch(): Promise<Row[]> {
103+
let threshold = INITIAL_THRESHOLD;
104+
105+
while (threshold <= MAX_THRESHOLD) {
106+
console.log(` Searching with distance threshold: ${threshold.toFixed(2)}`);
107+
108+
// Recompose pipeline with current threshold
109+
const source = new MockOperator(items);
110+
const withDistance = new VectorDistanceOperator(source, "embedding", queryVec);
111+
const topK = new TopKOperator(withDistance, "_distance", false, 50);
112+
const filtered = new FilterOperator(topK, [
113+
{ column: "_distance", op: "lt", value: threshold },
114+
]);
115+
116+
const results = await drainPipeline(filtered);
117+
console.log(` → Found ${results.length} results`);
118+
119+
if (results.length >= MIN_RESULTS) {
120+
return results;
121+
}
122+
123+
// Widen threshold and recompose
124+
threshold += WIDEN_STEP;
125+
console.log(` Too few results, widening threshold...\n`);
126+
}
127+
128+
// Final pass with no distance filter — just top 50
129+
console.log(" Final pass: returning all top-50 results");
130+
const source = new MockOperator(items);
131+
const withDistance = new VectorDistanceOperator(source, "embedding", queryVec);
132+
const topK = new TopKOperator(withDistance, "_distance", false, 50);
133+
return drainPipeline(topK);
134+
}
135+
136+
async function main() {
137+
console.log("Adaptive Vector Search\n");
138+
console.log("Pipeline is recomposed at runtime based on result quality.\n");
139+
140+
const results = await adaptiveSearch();
141+
142+
console.log(`\nFinal results: ${results.length} items`);
143+
for (const row of results.slice(0, 5)) {
144+
console.log(` ${row.label} | distance: ${(row._distance as number).toFixed(4)}`);
145+
}
146+
console.log(`\nKey insight: the pipeline was dynamically recomposed based on result count.`);
147+
console.log("A fixed query planner cannot do this — the plan IS the execution.");
148+
}
149+
150+
main().catch(console.error);

examples/custom-spill-backend.ts

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
/**
2+
* Custom Spill Backend — plug your own storage into sort and join operators.
3+
*
4+
* This example implements an InMemorySpillBackend that satisfies the SpillBackend
5+
* interface. It's used with ExternalSortOperator and HashJoinOperator at a tiny
6+
* 4KB memory budget to force spilling.
7+
*
8+
* DuckDB: disk-only spill. Polars: no spill at all.
9+
* QueryMode: any SpillBackend — R2, disk, memory, S3, you decide.
10+
*/
11+
12+
import {
13+
ExternalSortOperator,
14+
HashJoinOperator,
15+
drainPipeline,
16+
type Operator,
17+
type RowBatch,
18+
} from "../src/operators.js";
19+
import type { SpillBackend } from "../src/r2-spill.js";
20+
import type { Row } from "../src/types.js";
21+
22+
// ─── In-memory spill backend ────────────────────────────────────────────
23+
24+
class InMemorySpillBackend implements SpillBackend {
25+
private runs = new Map<string, Row[]>();
26+
private nextId = 0;
27+
bytesWritten = 0;
28+
bytesRead = 0;
29+
30+
async writeRun(rows: Row[]): Promise<string> {
31+
const id = `mem-run-${this.nextId++}`;
32+
const copy = rows.map(r => ({ ...r }));
33+
this.runs.set(id, copy);
34+
const size = rows.length * 100; // rough estimate
35+
this.bytesWritten += size;
36+
return id;
37+
}
38+
39+
async *streamRun(spillId: string): AsyncGenerator<Row> {
40+
const rows = this.runs.get(spillId);
41+
if (!rows) throw new Error(`Spill run not found: ${spillId}`);
42+
const size = rows.length * 100;
43+
this.bytesRead += size;
44+
for (const row of rows) yield row;
45+
}
46+
47+
async cleanup(): Promise<void> {
48+
this.runs.clear();
49+
}
50+
}
51+
52+
// ─── Mock data source ───────────────────────────────────────────────────
53+
54+
class MockOperator implements Operator {
55+
private rows: Row[];
56+
private cursor = 0;
57+
private batchSize: number;
58+
constructor(rows: Row[], batchSize = 50) {
59+
this.rows = rows;
60+
this.batchSize = batchSize;
61+
}
62+
async next(): Promise<RowBatch | null> {
63+
if (this.cursor >= this.rows.length) return null;
64+
const batch = this.rows.slice(this.cursor, this.cursor + this.batchSize);
65+
this.cursor += this.batchSize;
66+
return batch;
67+
}
68+
async close() {}
69+
}
70+
71+
// ─── Demo ───────────────────────────────────────────────────────────────
72+
73+
async function main() {
74+
console.log("Custom Spill Backend Demo\n");
75+
76+
// Generate 200 orders and 100 users
77+
const orders: Row[] = Array.from({ length: 200 }, (_, i) => ({
78+
order_id: i + 1,
79+
user_id: (i % 100) + 1,
80+
amount: ((i * 7 + 13) % 1000) + 1,
81+
}));
82+
83+
const users: Row[] = Array.from({ length: 100 }, (_, i) => ({
84+
user_id: i + 1,
85+
name: `User_${i + 1}`,
86+
}));
87+
88+
const spill = new InMemorySpillBackend();
89+
const TINY_BUDGET = 4 * 1024; // 4KB — forces spilling
90+
91+
// 1. External sort with custom spill backend
92+
console.log("1. External sort (4KB budget, forces spill to memory backend):");
93+
const sortSource = new MockOperator(orders);
94+
const sorted = new ExternalSortOperator(
95+
sortSource, "amount", true, 0, TINY_BUDGET, spill,
96+
);
97+
const sortedRows = await drainPipeline(sorted);
98+
console.log(` Sorted ${sortedRows.length} rows by amount desc`);
99+
console.log(` Spill stats: ${spill.bytesWritten} bytes written, ${spill.bytesRead} bytes read`);
100+
console.log(` Top 3: ${sortedRows.slice(0, 3).map(r => r.amount).join(", ")}\n`);
101+
102+
// Reset spill
103+
await spill.cleanup();
104+
spill.bytesWritten = 0;
105+
spill.bytesRead = 0;
106+
107+
// 2. Hash join with custom spill backend
108+
console.log("2. Hash join (4KB budget, forces Grace hash partitioning):");
109+
const leftSource = new MockOperator(orders);
110+
const rightSource = new MockOperator(users);
111+
const joined = new HashJoinOperator(
112+
leftSource, rightSource, "user_id", "user_id", "inner",
113+
TINY_BUDGET, spill,
114+
);
115+
const joinedRows = await drainPipeline(joined);
116+
console.log(` Joined ${joinedRows.length} rows`);
117+
console.log(` Spill stats: ${spill.bytesWritten} bytes written, ${spill.bytesRead} bytes read`);
118+
console.log(` Sample: order #${joinedRows[0]?.order_id}${joinedRows[0]?.name}\n`);
119+
120+
await spill.cleanup();
121+
122+
console.log("Key insight: spill storage is pluggable — R2, disk, memory, S3.");
123+
console.log("DuckDB: disk only. Polars: no spill at all.");
124+
console.log("QueryMode: implement SpillBackend and plug it in.");
125+
}
126+
127+
main().catch(console.error);

0 commit comments

Comments
 (0)