-
Notifications
You must be signed in to change notification settings - Fork 522
Description
Problem
On CPU-only machines (4 vCPU, 8GB RAM, shared Intel Xeon, no GPU),
qmd query is unusable because the reranking step takes 120s+ for 20 chunks.
The 1.0.8 query document format solves expansion: vec: skips the 1.7B LLM.
Embedding with the 300M model takes only 2-4s. But structuredSearch()
always reranks, and the 0.6B reranker running 20+ inference passes on
shared vCPUs takes >120s, making qmd query timeout every time.
vsearch works (~6s) because it skips reranking, but it still forces
query expansion through the 1.7B model, and doesn't support the new
query document syntax.
Benchmarks (4 vCPU, 8GB, gpu:false patch per #194)
| Command | Time | Notes |
|---|---|---|
qmd search "query" |
1.7s | BM25 only, fast |
qmd vsearch "query" |
6s | Expand + embed, no rerank |
qmd query "vec: query" |
>120s | Embed 2s + rerank 20 chunks timeout |
qmd query "lex: a\nvec: b" |
>60s | Embed 4s + rerank 38 chunks timeout |
Proposal
Add --no-rerank flag to qmd query that returns RRF-fused results directly,
skipping the chunked reranking step (lines 2602-2606 in store.js).
This would give CPU-only users the best of both worlds:
- Query document syntax (lex/vec/hyde combinations)
- RRF fusion across multiple result sets
- No 120s+ reranking penalty
Expected time with --no-rerank: ~4-6s (lex instant + embed 2-4s + RRF instant).
Alternatively, making vsearch accept query document syntax (so vec: queries
skip expansion) would also solve the problem.
Environment
- qmd 1.0.8 (commit 6ac7c68), dist/llm.js patched to gpu:false per Performance: Vulkan library detection adds 2-3s to every invocation on machines without a GPU #194
- Linux x86_64, 4 shared vCPU, 8GB RAM, no GPU
- node-llama-cpp CPU-only (Vulkan prebuilt incompatible, no CUDA)
Related
- Performance: Vulkan library detection adds 2-3s to every invocation on machines without a GPU #194 Vulkan probe overhead (workaround applied)
- Expose maxThreads option from node-llama-cpp to prevent OOM on resource-constrained systems #170 maxThreads for resource-constrained systems
- Feature request: Support hosted/cloud embedding providers (OpenAI, OpenRouter, etc.) #229 / [FEATURE]: models via API #114 Cloud model providers (alternative approach)