You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the executor supports cursors, `.stream()` fetches batches incrementally. Otherwise it falls back to `.collect()` and yields slices — still useful for processing without holding all rows in your code at once.
77
+
78
+
`.stream()` is also available on `LazyResultHandle`:
79
+
80
+
```typescript
81
+
const handle =awaitqm.table("events").lazy()
82
+
83
+
forawait (const batch ofhandle.stream(500)) {
84
+
process(batch)
85
+
}
86
+
```
87
+
77
88
## Cursor
78
89
79
-
`.cursor()` is similar but works directly on the DataFrame without an intermediate handle:
90
+
`.cursor()` is the low-level streaming primitive. It requires an executor with cursor support (e.g., edge mode) and throws if not available:
`.after(value)` translates to a `gt` filter on the sort column, which benefits from page-level skip. Every page is equally fast regardless of depth.
120
+
`.after(value)` translates to a `gt` filter on the sort column (or `lt` for descending sorts), which benefits from page-level skip. Every page is equally fast regardless of depth.
2. Fragment-level skip → skip files by column min/max across all pages (canSkipFragment)
13
+
3. Page-level skip → skip pages within a file by per-page stats (canSkipPage)
14
+
4. WASM SIMD scan → decode + filter in one pass (no Row[] intermediate)
15
+
5. Columnar merge → k-way merge on typed arrays (no Row[] until exit)
16
+
6. Row materialization → only at final response boundary
17
17
```
18
18
19
19
The most expensive step is always I/O (R2 reads). Everything else optimizes around reducing I/O.
@@ -40,15 +40,14 @@ Operators that accumulate state accept a memory budget (bytes). When exceeded, t
40
40
41
41
| Operator | Default budget | What it accumulates |
42
42
|----------|---------------|-------------------|
43
-
|`ExternalSortOperator`|32 MB | All rows until sorted |
44
-
|`HashJoinOperator`|32 MB | Build side hash table |
43
+
|`ExternalSortOperator`|256 MB | All rows until sorted |
44
+
|`HashJoinOperator`|256 MB | Build side hash table |
45
45
|`AggregateOperator`| unbounded | Group states (usually small) |
46
46
|`DistinctOperator`| unbounded | Seen-values hash set |
47
47
48
48
**Sizing guidance:**
49
-
-**32 MB** works for most queries up to ~5M rows of numeric data
50
-
-**128 MB** for string-heavy datasets or large joins
51
-
- Cloudflare DO limit is 128 MB total — leave headroom for page buffers and WASM memory
49
+
-**256 MB** (default) works for most queries — covers ~20M rows of numeric data or ~5M string rows
50
+
- Reduce to **64–128 MB** on Cloudflare DOs to leave headroom for page buffers and WASM memory (DO limit is 128 MB)
52
51
- Local mode (Node/Bun) has no practical limit — set budget to available RAM
53
52
54
53
```typescript
@@ -85,6 +84,20 @@ The DataFrame API picks automatically: `.sort().limit(k)` uses TopK when k is sm
85
84
| GROUP BY with few groups (< 10K) | Hash map of accumulators — fast |
86
85
| GROUP BY with many groups (> 100K) | Memory grows with cardinality — consider pre-filtering |
87
86
87
+
## Fragment-level skip
88
+
89
+
Before reading any page data, `canSkipFragment` aggregates min/max/nullCount across all pages in a fragment and checks if the entire fragment can be eliminated. This reuses the same `canSkipPage` logic but on fragment-wide stats — one check to skip potentially thousands of pages.
Fragment-level skip is automatic and costs nothing — it runs before any R2 I/O. For datasets with many small fragments (e.g., append-heavy workloads), this is often more effective than page-level skip because it eliminates entire R2 reads rather than individual pages within a read.
98
+
99
+
In explain output, `fragmentsSkipped` counts fragments eliminated by both partition pruning and fragment-level skip combined.
100
+
88
101
## Page-level skip
89
102
90
103
Each Lance page stores min/max stats per column. The scan layer checks these before reading page data:
@@ -175,7 +188,7 @@ Key fields in `ExplainResult`:
175
188
|-------|------------------|
176
189
|`totalRows`| Total rows in the table |
177
190
|`estimatedRows`| Rows remaining after pruning |
178
-
|`fragments` / `fragmentsSkipped`|How many fragments partition pruning eliminated|
191
+
|`fragments` / `fragmentsSkipped`|Fragments eliminated by partition pruning + fragment-level skip|
179
192
|`pagesTotal` / `pagesSkipped`| How many pages min/max pruning eliminated |
180
193
|`estimatedBytes` / `estimatedR2Reads`| Actual I/O cost (bytes and coalesced R2 reads) |
181
194
|`filters[].pushable`| Whether each filter is pushed to the scan layer |
@@ -190,7 +203,7 @@ Key fields in `ExplainResult`:
190
203
191
204
CI runs head-to-head benchmarks against DuckDB on every push. Typical results at 1M-5M rows:
0 commit comments