docs: add NDJSON support, Lance v1/v2 clarification, IVF-PQ read-only

teamchong · teamchong · commit ac7fb6f53526 · 2026-03-18T00:29:04.000-04:00
- formats.mdx: mention NDJSON/JSONL support, file-based CSV/JSON reading,
  Lance v1+v2 both supported (was only "v2")
- vector-search.mdx: clarify IVF-PQ is load-only (build externally),
  HNSW is the buildable index type, add comparison row
diff --git a/docs/src/content/docs/formats.mdx b/docs/src/content/docs/formats.mdx
@@ -1,6 +1,6 @@
 ---
 title: Formats
-description: Parquet, Lance v2, Iceberg, CSV, JSON, Arrow — same API regardless of format.
+description: Parquet, Lance, Iceberg, CSV, JSON/NDJSON, Arrow — same API regardless of format.
 ---
 
 QueryMode reads columnar and row-oriented formats with the same API. Format detection is automatic based on file magic bytes.
@@ -10,10 +10,10 @@ QueryMode reads columnar and row-oriented formats with the same API. Format dete
 | Format | Type | Page skip | Compression | Status |
 |--------|------|-----------|-------------|--------|
 | **Parquet** | Columnar | Min/max stats | Snappy, ZSTD, GZIP, LZ4 | Full support |
-| **Lance v2** | Columnar | Min/max stats | None (raw pages) | Full support |
+| **Lance v1/v2** | Columnar | Min/max stats | None (raw pages) | Full support |
 | **Iceberg** | Table format | Via Parquet | Via Parquet | Metadata + Parquet data |
 | **CSV** | Row | No | No | Via `fromCSV()` |
-| **JSON** | Row | No | No | Via `fromJSON()` |
+| **JSON / NDJSON** | Row | No | No | Via `fromJSON()` or file path |
 | **Arrow** | Columnar | No | No | In-memory only |
 
 ## Parquet
@@ -45,9 +45,9 @@ const result = await qm
 
 Snappy decompression is pure TypeScript. ZSTD, GZIP, and LZ4 use the WASM engine.
 
-## Lance v2
+## Lance v1 / v2
 
-Native Lance v2 format reader. Parses the 40-byte footer, column metadata protobuf, and page data.
+Native Lance format reader supporting both v1 and v2 layouts. Parses the 40-byte footer, column metadata protobuf, and page data.
 
 ```typescript
 const result = await qm
@@ -75,7 +75,7 @@ const result = await qm
 
 Supports Iceberg v1 and v2 metadata, type mapping from Iceberg schema to QueryMode types.
 
-## CSV and JSON
+## CSV, JSON, and NDJSON
 
 In-memory materialization for small datasets:
 
@@ -85,8 +85,15 @@ const df = await QueryMode.fromCSV(csvString, "my_table")
 
 // From JSON array
 const df = QueryMode.fromJSON(jsonArray, "my_table")
+
+// From file — auto-detected by extension and content
+const df = await qm.table("./data/events.json").collect()   // JSON array
+const df = await qm.table("./data/events.ndjson").collect()  // newline-delimited JSON
+const df = await qm.table("./data/events.csv").collect()     // CSV/TSV/PSV (auto-detect delimiter)
 ```
 
+CSV auto-detects delimiter (comma, tab, pipe) and infers column types from the data. JSON supports both `[{...}, {...}]` arrays and NDJSON (`{...}\n{...}`) based on the first non-whitespace byte.
+
 These materialize all data in memory. Use Parquet or Lance for large datasets.
 
 ## Format detection
diff --git a/docs/src/content/docs/vector-search.mdx b/docs/src/content/docs/vector-search.mdx
@@ -1,9 +1,9 @@
 ---
 title: Vector Search
-description: Similarity search with SIMD acceleration, IVF-PQ indexes, and text-to-vector encoding.
+description: Similarity search with SIMD acceleration, HNSW and IVF-PQ indexes, and text-to-vector encoding.
 ---
 
-QueryMode supports vector similarity search on embedding columns stored in Lance format. Searches use WASM SIMD for acceleration and IVF-PQ indexes when available.
+QueryMode supports vector similarity search on embedding columns stored in Lance format. Searches use WASM SIMD for acceleration and HNSW or IVF-PQ indexes when available.
 
 ## DataFrame API
 
@@ -60,14 +60,14 @@ The `NEAR` operator performs vector similarity search. `TOPK` limits results to
 
 Without an index, QueryMode performs brute-force SIMD-accelerated distance computation across all vectors. Fast for datasets under ~100K vectors.
 
-### IVF-PQ
+### IVF-PQ (load pre-built)
 
-For larger datasets, create an IVF-PQ (Inverted File with Product Quantization) index:
+IVF-PQ (Inverted File with Product Quantization) indexes can be loaded from R2 for search:
 
 - **IVF** partitions vectors into clusters. At query time, only `nprobe` clusters are searched.
 - **PQ** compresses vectors into compact codes, reducing memory and I/O.
 
-IVF-PQ indexes are stored alongside data in R2 and loaded on first query.
+IVF-PQ indexes must be built externally (e.g. with LanceDB or FAISS) and stored in R2 alongside the data. QueryMode loads and searches them via the WASM engine. For indexes you can build directly in QueryMode, use HNSW.
 
 ### HNSW
 
@@ -128,7 +128,8 @@ const results = restored.search(queryVec, 10)
 | **Speed** | Fast (quantized distances) | Fast (graph traversal) |
 | **Memory** | Low (compressed codes) | High (full vectors + graph) |
 | **Recall** | Good with enough probes | Excellent |
-| **Build time** | Requires training (k-means) | Incremental (add one at a time) |
+| **Build time** | External (k-means training) | Incremental (add one at a time) |
+| **Build in QueryMode** | No (load pre-built) | Yes (`HnswIndex`) |
 | **Best for** | Large datasets (&gt;1M vectors) | Medium datasets (&lt;1M vectors) |
 
 ## Combining with filters