qmd embed has no file size guard — large files cause resource exhaustion and pollute search results

## Problem

Two problems with embedding large files:

### 1. Resource exhaustion

A single large file can dominate embed time and memory. With 800-token chunks and 15% overlap, a file produces roughly 1 chunk per 2,720 characters of content:

| File size | Approximate chunks | Impact |
|-----------|--------------------|--------|
| 1MB | ~370 chunks | Noticeable slowdown |
| 5MB | ~1,840 chunks | Minutes of embed time |
| 50MB | ~19,275 chunks | Longer to embed than thousands of normal documents combined |

Each chunk requires a forward pass through the embedding model. A 50MB log file takes longer to embed than thousands of normal documents combined.

### 2. Search quality degradation

Large files — minified JS, CSV exports, log files, database dumps — produce thousands of low-value chunks that pollute search results. When 80% of your vector index is chunks from a single 50MB log file, BM25 and vector search both suffer.

### Prior work identifying this issue

The security audit in #69 identified this as **H3: Resource exhaustion via large files** and listed "Add file size limits" as a pre-v1.0 action item.

Issue #91 documents a related downstream symptom: large files (2MB+) cause reranker context size overflow during `qmd query`.

## Current Behavior

- `qmd embed` processes every non-empty file in the index, regardless of size
- No warning, no skip, no configurable limit
- The only way to exclude large files is `--mask` at collection creation time, which filters by glob pattern (file extension), not by size
- `qmd status` shows no indication of how many files are oversized

Contrast with `multi-get`, which already has a `--max-bytes` flag (default 10KB) and skips oversized files with a clear message. The embed command has no equivalent.

## Proposed Solution

Add a configurable file size limit to `qmd embed` with a sensible default. Three controls:

```bash
# Default: skip files > 5MB (no change needed for most users)
qmd embed

# Override via environment variable (in bytes)
QMD_MAX_EMBED_FILE_BYTES=10485760 qmd embed    # 10MB limit
QMD_MAX_EMBED_FILE_BYTES=1048576 qmd embed     # 1MB limit

# Bypass entirely
qmd embed --no-size-limit
```

### Behavior

1. **During `qmd embed`**: Files exceeding the limit are skipped with a warning to stderr:

   ```
   Skipping data/logs/app.log (12.3MB exceeds 5.0MB limit)
   ```

   A summary is shown at the end:

   ```
   3 file(s) skipped (exceeded 5.0MB file size limit). Use --no-size-limit to include all files.
   ```

2. **During `qmd status`**: The breakdown shows skipped files:

   ```
   Documents
     Total:    150 files indexed
     Vectors:  120 embedded
     Pending:  15 need embedding (run 'qmd embed')
     Skipped:  15 exceed 5.0MB size limit
   ```

3. **Default is safe**: 5MB covers virtually all source code, documentation, and notes. Files above 5MB are typically logs, data dumps, minified bundles, or binary-ish content that doesn't embed well anyway.

4. **No behavior change for existing users** unless they have files > 5MB indexed. Even then, `--no-size-limit` restores the old behavior.

### Design Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Default limit | 5MB | Covers ~99% of source/docs; files above are typically low-value for semantic search |
| Unit | Bytes (env var) | Consistent with how file sizes are reported in warnings; no ambiguity between MB/MiB |
| Override mechanism | Env var + CLI flag | Env var for persistent config, `--no-size-limit` for one-off runs |
| Scope | Embed only | `multi-get` already has its own `--max-bytes`; search results reference docs, not content |
| Size check method | `Buffer.byteLength(body, 'utf8')` | Accurate for all character encodings |

### Implementation

The implementation follows the established `multi-get` skip pattern and the env var parsing pattern from other qmd configuration:

- `getMaxEmbedFileBytes()` function in `qmd.ts` — parses `QMD_MAX_EMBED_FILE_BYTES` with validation and fallback (mirrors the pattern used elsewhere in the codebase)
- `getEmbedBreakdown()` function in `store.ts` — SQL query for status display showing pending vs. too-large counts
- `vectorIndex()` updated with `noSizeLimit` option parameter
- `--no-size-limit` CLI flag in `parseCLI()`
- Help text updated

~160 lines across `qmd.ts`, `store.ts`, and test files.

## Related

- #69 — Security audit: H3 "Resource exhaustion via large files" (pre-v1.0 action item)
- #91 — `qmd query fails with large files: reranker context size exceeded` (downstream symptom)
- #124 — `qmd embed fails after 30 minutes on large collections` (related: large corpus timeout)
- #151 — `Incremental embedding: skip unchanged chunks` (complementary optimization)
- #153 — PR implementing this feature (ready for review)

---

I have a working implementation with test coverage (10 unit tests for env var parsing, 5 unit tests for the SQL breakdown query, 5 CLI integration tests) and can update the PR if the approach looks reasonable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qmd embed has no file size guard — large files cause resource exhaustion and pollute search results #156

Problem

1. Resource exhaustion

2. Search quality degradation

Prior work identifying this issue

Current Behavior

Proposed Solution

Behavior

Design Decisions

Implementation

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File size	Approximate chunks	Impact
1MB	~370 chunks	Noticeable slowdown
5MB	~1,840 chunks	Minutes of embed time
50MB	~19,275 chunks	Longer to embed than thousands of normal documents combined

Decision	Choice	Rationale
Default limit	5MB	Covers ~99% of source/docs; files above are typically low-value for semantic search
Unit	Bytes (env var)	Consistent with how file sizes are reported in warnings; no ambiguity between MB/MiB
Override mechanism	Env var + CLI flag	Env var for persistent config, `--no-size-limit` for one-off runs
Scope	Embed only	`multi-get` already has its own `--max-bytes`; search results reference docs, not content
Size check method	`Buffer.byteLength(body, 'utf8')`	Accurate for all character encodings

qmd embed has no file size guard — large files cause resource exhaustion and pollute search results #156

Description

Problem

1. Resource exhaustion

2. Search quality degradation

Prior work identifying this issue

Current Behavior

Proposed Solution

Behavior

Design Decisions

Implementation

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions