feat: add RemoteLLM for GPU-accelerated embedding#135
Open
NathanSkene wants to merge 2 commits intotobi:mainfrom
Open
feat: add RemoteLLM for GPU-accelerated embedding#135NathanSkene wants to merge 2 commits intotobi:mainfrom
NathanSkene wants to merge 2 commits intotobi:mainfrom
Conversation
Adds a RemoteLLM class that calls a remote llama-server via HTTP, enabled by setting the QMD_REMOTE_URL environment variable. This allows running `qmd embed` against a GPU server (e.g. Lambda Labs, Vast.ai) instead of the local CPU, reducing embedding time from hours to minutes for large document collections. Key design decisions: - Local tokenization: tokenize/detokenize use a ~4 chars/token character-based approximation instead of calling the remote server. This avoids thousands of serial HTTP round-trips during document chunking, reducing chunking time from ~40 min to near-instant. - Retry with backoff: fetchWithRetry helper retries transient failures (network drops, timeouts) with exponential backoff, important for long-running embedding jobs over SSH tunnels. - Minimal surface area: only embed/embedBatch go to the remote server. generate, rerank, expandQuery are stubbed since only embedding is needed for `qmd embed`. Tested on 12,599 files (21,970 chunks) with 97.5% success rate. The ~2.5% failures are oversized chunks exceeding the server's context window — expected and tolerable. Usage: ssh -L 8080:localhost:8080 ubuntu@gpu-host -N & QMD_REMOTE_URL=http://localhost:8080 qmd embed
…remote embedding Three fixes for remote embedding reliability: 1. Sanitize lone UTF-16 surrogates (\uD800-\uDFFF) before JSON.stringify in RemoteLLM embed/embedBatch — ChatGPT export files contain invalid surrogates that break llama-server's JSON parser, causing entire 32-chunk batches to fail. 2. Use conservative 2 chars/token ratio (was 4) for remote tokenization approximation — code-heavy and Unicode-rich content tokenizes at ~2 chars/token, causing chunks to exceed the model's 2048-token context window. 3. Fix vec0 virtual table INSERT — sqlite-vec's vec0 doesn't support INSERT OR REPLACE, so use DELETE + INSERT pattern in insertEmbedding(). Also: add batch→individual fallback in embedBatch() to isolate failures, and clamp progress bar percent to prevent RangeError on String.prototype.repeat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey @NathanSkene -- nice work on the remote embedding support! I just opened #153 which adds a file size limit to One thing that might be relevant: your PR mentions ~2.5% of chunks fail due to exceeding the 8192-token context window. The size limit in #153 would reduce that by filtering out the largest files upfront. Could be complementary if both land. I'm also planning a small follow-up PR to add |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RemoteLLMclass that calls a remote llama-server via HTTP for GPU-accelerated embeddingQMD_REMOTE_URL=http://host:portenvironment variableHow it works
When
QMD_REMOTE_URLis set,getDefaultLlamaCpp()returns aRemoteLLMinstance instead of a localLlamaCpp. Onlyembed/embedBatchcalls go to the remote server — tokenization for chunking uses a local ~4 chars/token approximation to avoid thousands of serial HTTP round-trips.Usage
Design decisions
qmd embedworkflowQMD_REMOTE_URLTesting
Tested on 12,599 files / 21,970 chunks with 97.5% success rate. The ~2.5% failures are oversized chunks exceeding the server's 8192-token context window — expected and tolerable.
Changes
src/llm.ts: +153 lines (RemoteLLM class + factory function update)🤖 Generated with Claude Code