feat: add RemoteLLM for GPU-accelerated embedding by NathanSkene · Pull Request #135 · tobi/qmd

NathanSkene · 2026-02-08T12:45:57Z

Summary

Adds a RemoteLLM class that calls a remote llama-server via HTTP for GPU-accelerated embedding
Enabled by setting QMD_REMOTE_URL=http://host:port environment variable
Reduces embedding time from hours to minutes for large document collections (~12k files in ~20 min on an A10 GPU)

How it works

When QMD_REMOTE_URL is set, getDefaultLlamaCpp() returns a RemoteLLM instance instead of a local LlamaCpp. Only embed/embedBatch calls go to the remote server — tokenization for chunking uses a local ~4 chars/token approximation to avoid thousands of serial HTTP round-trips.

Usage

# Start llama-server on GPU machine (Lambda Labs, Vast.ai, etc.)
llama-server -m embeddinggemma-300M-Q8_0.gguf --embedding --port 8080 -c 8192 -b 2048 -ub 2048

# SSH tunnel from local machine
ssh -L 8080:localhost:8080 ubuntu@gpu-host -N &

# Run QMD embed with remote backend
QMD_REMOTE_URL=http://localhost:8080 qmd embed

Design decisions

Decision	Rationale
Local tokenization (~4 chars/token)	Remote tokenize/detokenize made thousands of serial HTTP calls during chunking — reduced from ~40 min to near-instant
fetchWithRetry with exponential backoff	Long-running jobs over SSH tunnels need resilience to transient network failures
Stubbed generate/rerank/expandQuery	Only embedding is needed for `qmd embed` workflow
Single env var activation	Zero config changes needed — just set `QMD_REMOTE_URL`

Testing

Tested on 12,599 files / 21,970 chunks with 97.5% success rate. The ~2.5% failures are oversized chunks exceeding the server's 8192-token context window — expected and tolerable.

Changes

src/llm.ts: +153 lines (RemoteLLM class + factory function update)

🤖 Generated with Claude Code

Adds a RemoteLLM class that calls a remote llama-server via HTTP, enabled by setting the QMD_REMOTE_URL environment variable. This allows running `qmd embed` against a GPU server (e.g. Lambda Labs, Vast.ai) instead of the local CPU, reducing embedding time from hours to minutes for large document collections. Key design decisions: - Local tokenization: tokenize/detokenize use a ~4 chars/token character-based approximation instead of calling the remote server. This avoids thousands of serial HTTP round-trips during document chunking, reducing chunking time from ~40 min to near-instant. - Retry with backoff: fetchWithRetry helper retries transient failures (network drops, timeouts) with exponential backoff, important for long-running embedding jobs over SSH tunnels. - Minimal surface area: only embed/embedBatch go to the remote server. generate, rerank, expandQuery are stubbed since only embedding is needed for `qmd embed`. Tested on 12,599 files (21,970 chunks) with 97.5% success rate. The ~2.5% failures are oversized chunks exceeding the server's context window — expected and tolerable. Usage: ssh -L 8080:localhost:8080 ubuntu@gpu-host -N & QMD_REMOTE_URL=http://localhost:8080 qmd embed

…remote embedding Three fixes for remote embedding reliability: 1. Sanitize lone UTF-16 surrogates (\uD800-\uDFFF) before JSON.stringify in RemoteLLM embed/embedBatch — ChatGPT export files contain invalid surrogates that break llama-server's JSON parser, causing entire 32-chunk batches to fail. 2. Use conservative 2 chars/token ratio (was 4) for remote tokenization approximation — code-heavy and Unicode-rich content tokenizes at ~2 chars/token, causing chunks to exceed the model's 2048-token context window. 3. Fix vec0 virtual table INSERT — sqlite-vec's vec0 doesn't support INSERT OR REPLACE, so use DELETE + INSERT pattern in insertEmbedding(). Also: add batch→individual fallback in embedBatch() to isolate failures, and clamp progress bar percent to prevent RangeError on String.prototype.repeat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brettdavies · 2026-02-11T04:14:25Z

Hey @NathanSkene -- nice work on the remote embedding support! I just opened #153 which adds a file size limit to qmd embed (skips files over 5MB before tokenization). It touches src/qmd.ts only, so there should be zero conflict with your src/llm.ts changes.

One thing that might be relevant: your PR mentions ~2.5% of chunks fail due to exceeding the 8192-token context window. The size limit in #153 would reduce that by filtering out the largest files upfront. Could be complementary if both land.

I'm also planning a small follow-up PR to add gpuLayers: "max" to loadModel() calls for local GPU acceleration. Wanted to flag it early since it also touches src/llm.ts -- happy to rebase on top of yours if it merges first.

NathanSkene and others added 2 commits February 8, 2026 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add RemoteLLM for GPU-accelerated embedding#135

feat: add RemoteLLM for GPU-accelerated embedding#135
NathanSkene wants to merge 2 commits intotobi:mainfrom
NathanSkene:feat/remote-llm-clean

NathanSkene commented Feb 8, 2026

Uh oh!

brettdavies commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

NathanSkene commented Feb 8, 2026

Summary

How it works

Usage

Design decisions

Testing

Changes

Uh oh!

brettdavies commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments