Skip to content

feat(llm): qmd query speedup - add environment variable overrides for faster model selection#112

Open
dgilperez wants to merge 1 commit intotobi:mainfrom
Balneario-de-Cofrentes:feat/jina-reranker-tiny
Open

feat(llm): qmd query speedup - add environment variable overrides for faster model selection#112
dgilperez wants to merge 1 commit intotobi:mainfrom
Balneario-de-Cofrentes:feat/jina-reranker-tiny

Conversation

@dgilperez
Copy link
Contributor

Problem

When using qmd query for latency-sensitive applications (like AI agent memory search with 5-second timeouts), the default reranker becomes a bottleneck. For instance, on my Mac Mini M4 16GB, qmd query's full process takes ~15s

Solution

This PR allows to select faster models by adding environment variables for runtime model configuration without code changes:

  • QMD_EMBED_MODEL - override embedding model
  • QMD_GENERATE_MODEL - override query expansion model
  • QMD_RERANK_MODEL - override reranker model
  • QMD_MODEL_CACHE_DIR - override model cache directory

Priority: config object > environment variable > default

I managed to get great results with Jina Reranker v1-tiny - a 33M parameter distilled model optimized for speed, without changing defaults or rebuilding.

Example:
export QMD_RERANK_MODEL='hf:gpustack/jina-reranker-v1-tiny-en-GGUF/jina-reranker-v1-tiny-en-FP16.gguf'
qmd query "your search"

Benchmarks on Mac Mini M4 16GB:

Scenario qwen3 (default) jina-tiny Speedup
Cold start (8 docs) ~15,000ms ~80ms 185x
Warm cache (8 docs) 400-440ms 40-65ms 6-10x
Full pipeline cold ~20s ~7s 3x
Full pipeline warm ~15s ~5.7s 2.6x

Quality: ~78% top-3 ranking agreement (same relevant docs, order differs slightly).

Includes:

  • Unit tests for env var configuration
  • README documentation with model override examples

No breaking changes. Default remains qwen3-reranker-0.6b.

dgilperez added a commit to Balneario-de-Cofrentes/qmd that referenced this pull request Feb 4, 2026
@busbyjon
Copy link

👍 for this PR. I'm struggling to run this model on a CPU contrained system - and this would solve it!

neomatrixyzy pushed a commit to neomatrixyzy/qmd that referenced this pull request Feb 19, 2026
…arch

- Model paths: QMD_EMBED_MODEL, QMD_RERANK_MODEL, QMD_GENERATE_MODEL (PR tobi#112 pattern)
- Embed format: QMD_EMBED_FORMAT=qwen3 for Qwen3 Instruct format
- Embed context: QMD_EMBED_CONTEXT_SIZE, QMD_EMBED_BATCH_SIZE, QMD_EMBED_PER_CONTEXT_MB
- Rerank context: QMD_RERANK_CONTEXT_SIZE
- Session timeout: QMD_SESSION_MAX_DURATION_MS
- Model identifiers: QMD_EMBED_MODEL_ID, QMD_RERANK_MODEL_ID
- Strong signal: QMD_STRONG_SIGNAL_MIN_GAP
- CJK bigram splitting for FTS5 search (unicode61 tokenizer workaround)

All env vars are optional — without them, behavior is identical to upstream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
neomatrixyzy pushed a commit to neomatrixyzy/qmd that referenced this pull request Feb 19, 2026
…arch

- Model paths: QMD_EMBED_MODEL, QMD_RERANK_MODEL, QMD_GENERATE_MODEL (PR tobi#112 pattern)
- Embed format: QMD_EMBED_FORMAT=qwen3 for Qwen3 Instruct format
- Embed context: QMD_EMBED_CONTEXT_SIZE, QMD_EMBED_BATCH_SIZE, QMD_EMBED_PER_CONTEXT_MB
- Rerank context: QMD_RERANK_CONTEXT_SIZE
- Session timeout: QMD_SESSION_MAX_DURATION_MS
- Model identifiers: QMD_EMBED_MODEL_ID, QMD_RERANK_MODEL_ID
- Strong signal: QMD_STRONG_SIGNAL_MIN_GAP
- CJK bigram splitting for FTS5 search (unicode61 tokenizer workaround)

All env vars are optional — without them, behavior is identical to upstream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add environment variables for runtime model configuration without code changes:
- QMD_EMBED_MODEL - override embedding model
- QMD_GENERATE_MODEL - override query expansion model
- QMD_RERANK_MODEL - override reranker model
- QMD_MODEL_CACHE_DIR - override model cache directory

Priority: config object > environment variable > default

Use case: Switch to faster reranker (jina-reranker-v1-tiny) for latency-critical
applications like API timeouts, without changing defaults or rebuilding.

Example:
  export QMD_RERANK_MODEL='hf:gpustack/jina-reranker-v1-tiny-en-GGUF/jina-reranker-v1-tiny-en-FP16.gguf'
  qmd query "your search"

Benchmarks on Mac Mini M4 16GB:

| Scenario            | qwen3 (default) | jina-tiny | Speedup |
|---------------------|-----------------|-----------|---------|
| Cold start (8 docs) | ~15,000ms       | ~80ms     | 185x    |
| Warm cache (8 docs) | 400-440ms       | 40-65ms   | 6-10x   |
| Full pipeline cold  | ~20s            | ~7s       | 3x      |
| Full pipeline warm  | ~15s            | ~5.7s     | 2.6x    |

Quality: ~78% top-3 ranking agreement (same relevant docs, order differs slightly).

Includes:
- Unit tests for env var configuration
- README documentation with model override examples

No breaking changes. Default remains qwen3-reranker-0.6b.
@dgilperez dgilperez force-pushed the feat/jina-reranker-tiny branch from a5656c0 to 993faf6 Compare February 20, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments