Skip to content

qmd embed has no file size guard — large files cause resource exhaustion and pollute search results #156

@brettdavies

Description

@brettdavies

Problem

Two problems with embedding large files:

1. Resource exhaustion

A single large file can dominate embed time and memory. With 800-token chunks and 15% overlap, a file produces roughly 1 chunk per 2,720 characters of content:

File size Approximate chunks Impact
1MB ~370 chunks Noticeable slowdown
5MB ~1,840 chunks Minutes of embed time
50MB ~19,275 chunks Longer to embed than thousands of normal documents combined

Each chunk requires a forward pass through the embedding model. A 50MB log file takes longer to embed than thousands of normal documents combined.

2. Search quality degradation

Large files — minified JS, CSV exports, log files, database dumps — produce thousands of low-value chunks that pollute search results. When 80% of your vector index is chunks from a single 50MB log file, BM25 and vector search both suffer.

Prior work identifying this issue

The security audit in #69 identified this as H3: Resource exhaustion via large files and listed "Add file size limits" as a pre-v1.0 action item.

Issue #91 documents a related downstream symptom: large files (2MB+) cause reranker context size overflow during qmd query.

Current Behavior

  • qmd embed processes every non-empty file in the index, regardless of size
  • No warning, no skip, no configurable limit
  • The only way to exclude large files is --mask at collection creation time, which filters by glob pattern (file extension), not by size
  • qmd status shows no indication of how many files are oversized

Contrast with multi-get, which already has a --max-bytes flag (default 10KB) and skips oversized files with a clear message. The embed command has no equivalent.

Proposed Solution

Add a configurable file size limit to qmd embed with a sensible default. Three controls:

# Default: skip files > 5MB (no change needed for most users)
qmd embed

# Override via environment variable (in bytes)
QMD_MAX_EMBED_FILE_BYTES=10485760 qmd embed    # 10MB limit
QMD_MAX_EMBED_FILE_BYTES=1048576 qmd embed     # 1MB limit

# Bypass entirely
qmd embed --no-size-limit

Behavior

  1. During qmd embed: Files exceeding the limit are skipped with a warning to stderr:

    Skipping data/logs/app.log (12.3MB exceeds 5.0MB limit)
    

    A summary is shown at the end:

    3 file(s) skipped (exceeded 5.0MB file size limit). Use --no-size-limit to include all files.
    
  2. During qmd status: The breakdown shows skipped files:

    Documents
      Total:    150 files indexed
      Vectors:  120 embedded
      Pending:  15 need embedding (run 'qmd embed')
      Skipped:  15 exceed 5.0MB size limit
    
  3. Default is safe: 5MB covers virtually all source code, documentation, and notes. Files above 5MB are typically logs, data dumps, minified bundles, or binary-ish content that doesn't embed well anyway.

  4. No behavior change for existing users unless they have files > 5MB indexed. Even then, --no-size-limit restores the old behavior.

Design Decisions

Decision Choice Rationale
Default limit 5MB Covers ~99% of source/docs; files above are typically low-value for semantic search
Unit Bytes (env var) Consistent with how file sizes are reported in warnings; no ambiguity between MB/MiB
Override mechanism Env var + CLI flag Env var for persistent config, --no-size-limit for one-off runs
Scope Embed only multi-get already has its own --max-bytes; search results reference docs, not content
Size check method Buffer.byteLength(body, 'utf8') Accurate for all character encodings

Implementation

The implementation follows the established multi-get skip pattern and the env var parsing pattern from other qmd configuration:

  • getMaxEmbedFileBytes() function in qmd.ts — parses QMD_MAX_EMBED_FILE_BYTES with validation and fallback (mirrors the pattern used elsewhere in the codebase)
  • getEmbedBreakdown() function in store.ts — SQL query for status display showing pending vs. too-large counts
  • vectorIndex() updated with noSizeLimit option parameter
  • --no-size-limit CLI flag in parseCLI()
  • Help text updated

~160 lines across qmd.ts, store.ts, and test files.

Related


I have a working implementation with test coverage (10 unit tests for env var parsing, 5 unit tests for the SQL breakdown query, 5 CLI integration tests) and can update the PR if the approach looks reasonable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions