Skip to content

Commit ac7fb6f

Browse files
committed
docs: add NDJSON support, Lance v1/v2 clarification, IVF-PQ read-only
- formats.mdx: mention NDJSON/JSONL support, file-based CSV/JSON reading, Lance v1+v2 both supported (was only "v2") - vector-search.mdx: clarify IVF-PQ is load-only (build externally), HNSW is the buildable index type, add comparison row
1 parent 61ff0ab commit ac7fb6f

2 files changed

Lines changed: 20 additions & 12 deletions

File tree

docs/src/content/docs/formats.mdx

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Formats
3-
description: Parquet, Lance v2, Iceberg, CSV, JSON, Arrow — same API regardless of format.
3+
description: Parquet, Lance, Iceberg, CSV, JSON/NDJSON, Arrow — same API regardless of format.
44
---
55

66
QueryMode reads columnar and row-oriented formats with the same API. Format detection is automatic based on file magic bytes.
@@ -10,10 +10,10 @@ QueryMode reads columnar and row-oriented formats with the same API. Format dete
1010
| Format | Type | Page skip | Compression | Status |
1111
|--------|------|-----------|-------------|--------|
1212
| **Parquet** | Columnar | Min/max stats | Snappy, ZSTD, GZIP, LZ4 | Full support |
13-
| **Lance v2** | Columnar | Min/max stats | None (raw pages) | Full support |
13+
| **Lance v1/v2** | Columnar | Min/max stats | None (raw pages) | Full support |
1414
| **Iceberg** | Table format | Via Parquet | Via Parquet | Metadata + Parquet data |
1515
| **CSV** | Row | No | No | Via `fromCSV()` |
16-
| **JSON** | Row | No | No | Via `fromJSON()` |
16+
| **JSON / NDJSON** | Row | No | No | Via `fromJSON()` or file path |
1717
| **Arrow** | Columnar | No | No | In-memory only |
1818

1919
## Parquet
@@ -45,9 +45,9 @@ const result = await qm
4545

4646
Snappy decompression is pure TypeScript. ZSTD, GZIP, and LZ4 use the WASM engine.
4747

48-
## Lance v2
48+
## Lance v1 / v2
4949

50-
Native Lance v2 format reader. Parses the 40-byte footer, column metadata protobuf, and page data.
50+
Native Lance format reader supporting both v1 and v2 layouts. Parses the 40-byte footer, column metadata protobuf, and page data.
5151

5252
```typescript
5353
const result = await qm
@@ -75,7 +75,7 @@ const result = await qm
7575

7676
Supports Iceberg v1 and v2 metadata, type mapping from Iceberg schema to QueryMode types.
7777

78-
## CSV and JSON
78+
## CSV, JSON, and NDJSON
7979

8080
In-memory materialization for small datasets:
8181

@@ -85,8 +85,15 @@ const df = await QueryMode.fromCSV(csvString, "my_table")
8585

8686
// From JSON array
8787
const df = QueryMode.fromJSON(jsonArray, "my_table")
88+
89+
// From file — auto-detected by extension and content
90+
const df = await qm.table("./data/events.json").collect() // JSON array
91+
const df = await qm.table("./data/events.ndjson").collect() // newline-delimited JSON
92+
const df = await qm.table("./data/events.csv").collect() // CSV/TSV/PSV (auto-detect delimiter)
8893
```
8994

95+
CSV auto-detects delimiter (comma, tab, pipe) and infers column types from the data. JSON supports both `[{...}, {...}]` arrays and NDJSON (`{...}\n{...}`) based on the first non-whitespace byte.
96+
9097
These materialize all data in memory. Use Parquet or Lance for large datasets.
9198

9299
## Format detection

docs/src/content/docs/vector-search.mdx

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Vector Search
3-
description: Similarity search with SIMD acceleration, IVF-PQ indexes, and text-to-vector encoding.
3+
description: Similarity search with SIMD acceleration, HNSW and IVF-PQ indexes, and text-to-vector encoding.
44
---
55

6-
QueryMode supports vector similarity search on embedding columns stored in Lance format. Searches use WASM SIMD for acceleration and IVF-PQ indexes when available.
6+
QueryMode supports vector similarity search on embedding columns stored in Lance format. Searches use WASM SIMD for acceleration and HNSW or IVF-PQ indexes when available.
77

88
## DataFrame API
99

@@ -60,14 +60,14 @@ The `NEAR` operator performs vector similarity search. `TOPK` limits results to
6060

6161
Without an index, QueryMode performs brute-force SIMD-accelerated distance computation across all vectors. Fast for datasets under ~100K vectors.
6262

63-
### IVF-PQ
63+
### IVF-PQ (load pre-built)
6464

65-
For larger datasets, create an IVF-PQ (Inverted File with Product Quantization) index:
65+
IVF-PQ (Inverted File with Product Quantization) indexes can be loaded from R2 for search:
6666

6767
- **IVF** partitions vectors into clusters. At query time, only `nprobe` clusters are searched.
6868
- **PQ** compresses vectors into compact codes, reducing memory and I/O.
6969

70-
IVF-PQ indexes are stored alongside data in R2 and loaded on first query.
70+
IVF-PQ indexes must be built externally (e.g. with LanceDB or FAISS) and stored in R2 alongside the data. QueryMode loads and searches them via the WASM engine. For indexes you can build directly in QueryMode, use HNSW.
7171

7272
### HNSW
7373

@@ -128,7 +128,8 @@ const results = restored.search(queryVec, 10)
128128
| **Speed** | Fast (quantized distances) | Fast (graph traversal) |
129129
| **Memory** | Low (compressed codes) | High (full vectors + graph) |
130130
| **Recall** | Good with enough probes | Excellent |
131-
| **Build time** | Requires training (k-means) | Incremental (add one at a time) |
131+
| **Build time** | External (k-means training) | Incremental (add one at a time) |
132+
| **Build in QueryMode** | No (load pre-built) | Yes (`HnswIndex`) |
132133
| **Best for** | Large datasets (>1M vectors) | Medium datasets (<1M vectors) |
133134

134135
## Combining with filters

0 commit comments

Comments
 (0)