-
Notifications
You must be signed in to change notification settings - Fork 522
Description
Problem
Two problems with embedding large files:
1. Resource exhaustion
A single large file can dominate embed time and memory. With 800-token chunks and 15% overlap, a file produces roughly 1 chunk per 2,720 characters of content:
| File size | Approximate chunks | Impact |
|---|---|---|
| 1MB | ~370 chunks | Noticeable slowdown |
| 5MB | ~1,840 chunks | Minutes of embed time |
| 50MB | ~19,275 chunks | Longer to embed than thousands of normal documents combined |
Each chunk requires a forward pass through the embedding model. A 50MB log file takes longer to embed than thousands of normal documents combined.
2. Search quality degradation
Large files — minified JS, CSV exports, log files, database dumps — produce thousands of low-value chunks that pollute search results. When 80% of your vector index is chunks from a single 50MB log file, BM25 and vector search both suffer.
Prior work identifying this issue
The security audit in #69 identified this as H3: Resource exhaustion via large files and listed "Add file size limits" as a pre-v1.0 action item.
Issue #91 documents a related downstream symptom: large files (2MB+) cause reranker context size overflow during qmd query.
Current Behavior
qmd embedprocesses every non-empty file in the index, regardless of size- No warning, no skip, no configurable limit
- The only way to exclude large files is
--maskat collection creation time, which filters by glob pattern (file extension), not by size qmd statusshows no indication of how many files are oversized
Contrast with multi-get, which already has a --max-bytes flag (default 10KB) and skips oversized files with a clear message. The embed command has no equivalent.
Proposed Solution
Add a configurable file size limit to qmd embed with a sensible default. Three controls:
# Default: skip files > 5MB (no change needed for most users)
qmd embed
# Override via environment variable (in bytes)
QMD_MAX_EMBED_FILE_BYTES=10485760 qmd embed # 10MB limit
QMD_MAX_EMBED_FILE_BYTES=1048576 qmd embed # 1MB limit
# Bypass entirely
qmd embed --no-size-limitBehavior
-
During
qmd embed: Files exceeding the limit are skipped with a warning to stderr:Skipping data/logs/app.log (12.3MB exceeds 5.0MB limit)A summary is shown at the end:
3 file(s) skipped (exceeded 5.0MB file size limit). Use --no-size-limit to include all files. -
During
qmd status: The breakdown shows skipped files:Documents Total: 150 files indexed Vectors: 120 embedded Pending: 15 need embedding (run 'qmd embed') Skipped: 15 exceed 5.0MB size limit -
Default is safe: 5MB covers virtually all source code, documentation, and notes. Files above 5MB are typically logs, data dumps, minified bundles, or binary-ish content that doesn't embed well anyway.
-
No behavior change for existing users unless they have files > 5MB indexed. Even then,
--no-size-limitrestores the old behavior.
Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Default limit | 5MB | Covers ~99% of source/docs; files above are typically low-value for semantic search |
| Unit | Bytes (env var) | Consistent with how file sizes are reported in warnings; no ambiguity between MB/MiB |
| Override mechanism | Env var + CLI flag | Env var for persistent config, --no-size-limit for one-off runs |
| Scope | Embed only | multi-get already has its own --max-bytes; search results reference docs, not content |
| Size check method | Buffer.byteLength(body, 'utf8') |
Accurate for all character encodings |
Implementation
The implementation follows the established multi-get skip pattern and the env var parsing pattern from other qmd configuration:
getMaxEmbedFileBytes()function inqmd.ts— parsesQMD_MAX_EMBED_FILE_BYTESwith validation and fallback (mirrors the pattern used elsewhere in the codebase)getEmbedBreakdown()function instore.ts— SQL query for status display showing pending vs. too-large countsvectorIndex()updated withnoSizeLimitoption parameter--no-size-limitCLI flag inparseCLI()- Help text updated
~160 lines across qmd.ts, store.ts, and test files.
Related
- Security Issues #69 — Security audit: H3 "Resource exhaustion via large files" (pre-v1.0 action item)
- qmd query fails with large files: reranker context size exceeded #91 —
qmd query fails with large files: reranker context size exceeded(downstream symptom) - qmd embed fails after 30 minutes on large collections (session maxDuration timeout) #124 —
qmd embed fails after 30 minutes on large collections(related: large corpus timeout) - Incremental embedding: skip unchanged chunks in qmd embed #151 —
Incremental embedding: skip unchanged chunks(complementary optimization) - feat(embed): skip oversized files with configurable size limit #153 — PR implementing this feature (ready for review)
I have a working implementation with test coverage (10 unit tests for env var parsing, 5 unit tests for the SQL breakdown query, 5 CLI integration tests) and can update the PR if the approach looks reasonable.