Design: Hybrid enrichment model (document-level canonical + chunk-level local signals)

## Summary
Current enrichment is chunk-queued but document-summary generation is triggered only in the task that processes the last chunk. This creates a fragile coupling between queue/task mechanics and document-level quality.

We should formalize a hybrid model:
- **Document-level metadata/summary** as canonical truth
- **Chunk-level metadata/summary** as optional retrieval/ranking signal

## Current Status Quo (as implemented)
- Enrichment is queued and tracked **per chunk** (`task_queue` + `chunks.enrichment_status`).
- Worker runs tier-2 per chunk, but tier-3 document extraction only when `chunkIndex == totalChunks - 1`.
- If `allChunks` is missing, tier-3 may run with only last-chunk text.
- API strips summary fields from chunk `tier3_meta` and stores summary variants on `documents`.

## Design Flaws / Risks
1. **Last-chunk coupling**
   - Document-level output depends on a specific chunk task being claimed and completed.
2. **Input completeness risk**
   - Missing `allChunks` can degrade document summaries (last-chunk-only fallback).
3. **Operational confusion**
   - Users think in documents/URLs, but queue and force/clear operate in chunk tasks.
4. **No localized semantic signal**
   - Retrieval quality for long docs can improve with chunk-local summaries/keywords, but we currently only keep doc-level summaries.

## Proposed Target Design
### 1) Keep document-level canonical outputs
- Keep `documents.summary_short|summary_medium|summary_long|summary` as source of truth.
- Keep global entities/relationships at document graph level.

### 2) Add lightweight chunk-level semantic signals
- Add optional chunk-local fields (e.g. in `chunks.tier3_meta`):
  - `chunk_summary_short` (1 sentence)
  - `chunk_keywords` (small list)
- Do **not** duplicate full document summary into every chunk.

### 3) Decouple document-level extraction from "last chunk task"
- Preferred: explicit document-level aggregation/extraction step that reads all chunks from DB.
- Alternative (interim): guarantee `allChunks` presence + idempotent guard so document-level extraction runs once per doc and is retry-safe.

### 4) Clarify CLI semantics
- Make enrich output explicit that enqueue/clear counts are **chunk tasks**.
- Add optional doc-level stats views (`docs_total`, `docs_enriched`, `docs_partial`).

## Rollout Plan
### Phase 1 (safe MVP)
- Add chunk-local summary/keywords fields (optional, small payload).
- Keep current doc-level summary writes unchanged.
- Update CLI/help text to explicitly call out chunk-task semantics.

### Phase 2
- Introduce document-level enrichment step decoupled from last chunk.
- Make it idempotent and retry-safe.

### Phase 3
- Use chunk-local signals in ranking/snippet generation behind a feature flag.
- Measure retrieval quality impact before default-on.

## Acceptance Criteria
- Document summary generation no longer depends on "last chunk" execution order.
- No last-chunk-only degradation path when `allChunks` missing.
- CLI clearly communicates task granularity and doc-level progress.
- Query quality benchmark for long documents shows no regression; target improvement in top-k relevance/snippet usefulness.

## Non-goals
- No full metadata duplication from document to all chunks.
- No breaking API changes for existing query/ingest clients.

## Why now
Recent bookmark ingestion/enrichment runs highlighted user confusion (URLs vs chunks) and surfaced the architectural coupling around last-chunk document extraction. Addressing this now reduces correctness risk and improves operational clarity.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: Hybrid enrichment model (document-level canonical + chunk-level local signals) #86

Summary

Current Status Quo (as implemented)

Design Flaws / Risks

Proposed Target Design

1) Keep document-level canonical outputs

2) Add lightweight chunk-level semantic signals

3) Decouple document-level extraction from "last chunk task"

4) Clarify CLI semantics

Rollout Plan

Phase 1 (safe MVP)

Phase 2

Phase 3

Acceptance Criteria

Non-goals

Why now

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Design: Hybrid enrichment model (document-level canonical + chunk-level local signals) #86

Description

Summary

Current Status Quo (as implemented)

Design Flaws / Risks

Proposed Target Design

1) Keep document-level canonical outputs

2) Add lightweight chunk-level semantic signals

3) Decouple document-level extraction from "last chunk task"

4) Clarify CLI semantics

Rollout Plan

Phase 1 (safe MVP)

Phase 2

Phase 3

Acceptance Criteria

Non-goals

Why now

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions