Skip to content

feat: add RemoteLLM for GPU-accelerated embedding#135

Open
NathanSkene wants to merge 2 commits intotobi:mainfrom
NathanSkene:feat/remote-llm-clean
Open

feat: add RemoteLLM for GPU-accelerated embedding#135
NathanSkene wants to merge 2 commits intotobi:mainfrom
NathanSkene:feat/remote-llm-clean

Conversation

@NathanSkene
Copy link

Summary

  • Adds a RemoteLLM class that calls a remote llama-server via HTTP for GPU-accelerated embedding
  • Enabled by setting QMD_REMOTE_URL=http://host:port environment variable
  • Reduces embedding time from hours to minutes for large document collections (~12k files in ~20 min on an A10 GPU)

How it works

When QMD_REMOTE_URL is set, getDefaultLlamaCpp() returns a RemoteLLM instance instead of a local LlamaCpp. Only embed/embedBatch calls go to the remote server — tokenization for chunking uses a local ~4 chars/token approximation to avoid thousands of serial HTTP round-trips.

Usage

# Start llama-server on GPU machine (Lambda Labs, Vast.ai, etc.)
llama-server -m embeddinggemma-300M-Q8_0.gguf --embedding --port 8080 -c 8192 -b 2048 -ub 2048

# SSH tunnel from local machine
ssh -L 8080:localhost:8080 ubuntu@gpu-host -N &

# Run QMD embed with remote backend
QMD_REMOTE_URL=http://localhost:8080 qmd embed

Design decisions

Decision Rationale
Local tokenization (~4 chars/token) Remote tokenize/detokenize made thousands of serial HTTP calls during chunking — reduced from ~40 min to near-instant
fetchWithRetry with exponential backoff Long-running jobs over SSH tunnels need resilience to transient network failures
Stubbed generate/rerank/expandQuery Only embedding is needed for qmd embed workflow
Single env var activation Zero config changes needed — just set QMD_REMOTE_URL

Testing

Tested on 12,599 files / 21,970 chunks with 97.5% success rate. The ~2.5% failures are oversized chunks exceeding the server's 8192-token context window — expected and tolerable.

Changes

  • src/llm.ts: +153 lines (RemoteLLM class + factory function update)

🤖 Generated with Claude Code

NathanSkene and others added 2 commits February 8, 2026 12:45
Adds a RemoteLLM class that calls a remote llama-server via HTTP,
enabled by setting the QMD_REMOTE_URL environment variable. This
allows running `qmd embed` against a GPU server (e.g. Lambda Labs,
Vast.ai) instead of the local CPU, reducing embedding time from
hours to minutes for large document collections.

Key design decisions:

- Local tokenization: tokenize/detokenize use a ~4 chars/token
  character-based approximation instead of calling the remote server.
  This avoids thousands of serial HTTP round-trips during document
  chunking, reducing chunking time from ~40 min to near-instant.

- Retry with backoff: fetchWithRetry helper retries transient
  failures (network drops, timeouts) with exponential backoff,
  important for long-running embedding jobs over SSH tunnels.

- Minimal surface area: only embed/embedBatch go to the remote
  server. generate, rerank, expandQuery are stubbed since only
  embedding is needed for `qmd embed`.

Tested on 12,599 files (21,970 chunks) with 97.5% success rate.
The ~2.5% failures are oversized chunks exceeding the server's
context window — expected and tolerable.

Usage:
  ssh -L 8080:localhost:8080 ubuntu@gpu-host -N &
  QMD_REMOTE_URL=http://localhost:8080 qmd embed
…remote embedding

Three fixes for remote embedding reliability:

1. Sanitize lone UTF-16 surrogates (\uD800-\uDFFF) before JSON.stringify in
   RemoteLLM embed/embedBatch — ChatGPT export files contain invalid surrogates
   that break llama-server's JSON parser, causing entire 32-chunk batches to fail.

2. Use conservative 2 chars/token ratio (was 4) for remote tokenization
   approximation — code-heavy and Unicode-rich content tokenizes at ~2 chars/token,
   causing chunks to exceed the model's 2048-token context window.

3. Fix vec0 virtual table INSERT — sqlite-vec's vec0 doesn't support INSERT OR
   REPLACE, so use DELETE + INSERT pattern in insertEmbedding().

Also: add batch→individual fallback in embedBatch() to isolate failures, and
clamp progress bar percent to prevent RangeError on String.prototype.repeat.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@brettdavies
Copy link

Hey @NathanSkene -- nice work on the remote embedding support! I just opened #153 which adds a file size limit to qmd embed (skips files over 5MB before tokenization). It touches src/qmd.ts only, so there should be zero conflict with your src/llm.ts changes.

One thing that might be relevant: your PR mentions ~2.5% of chunks fail due to exceeding the 8192-token context window. The size limit in #153 would reduce that by filtering out the largest files upfront. Could be complementary if both land.

I'm also planning a small follow-up PR to add gpuLayers: "max" to loadModel() calls for local GPU acceleration. Wanted to flag it early since it also touches src/llm.ts -- happy to rebase on top of yours if it merges first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments