-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Current enrichment is chunk-queued but document-summary generation is triggered only in the task that processes the last chunk. This creates a fragile coupling between queue/task mechanics and document-level quality.
We should formalize a hybrid model:
- Document-level metadata/summary as canonical truth
- Chunk-level metadata/summary as optional retrieval/ranking signal
Current Status Quo (as implemented)
- Enrichment is queued and tracked per chunk (
task_queue+chunks.enrichment_status). - Worker runs tier-2 per chunk, but tier-3 document extraction only when
chunkIndex == totalChunks - 1. - If
allChunksis missing, tier-3 may run with only last-chunk text. - API strips summary fields from chunk
tier3_metaand stores summary variants ondocuments.
Design Flaws / Risks
- Last-chunk coupling
- Document-level output depends on a specific chunk task being claimed and completed.
- Input completeness risk
- Missing
allChunkscan degrade document summaries (last-chunk-only fallback).
- Missing
- Operational confusion
- Users think in documents/URLs, but queue and force/clear operate in chunk tasks.
- No localized semantic signal
- Retrieval quality for long docs can improve with chunk-local summaries/keywords, but we currently only keep doc-level summaries.
Proposed Target Design
1) Keep document-level canonical outputs
- Keep
documents.summary_short|summary_medium|summary_long|summaryas source of truth. - Keep global entities/relationships at document graph level.
2) Add lightweight chunk-level semantic signals
- Add optional chunk-local fields (e.g. in
chunks.tier3_meta):chunk_summary_short(1 sentence)chunk_keywords(small list)
- Do not duplicate full document summary into every chunk.
3) Decouple document-level extraction from "last chunk task"
- Preferred: explicit document-level aggregation/extraction step that reads all chunks from DB.
- Alternative (interim): guarantee
allChunkspresence + idempotent guard so document-level extraction runs once per doc and is retry-safe.
4) Clarify CLI semantics
- Make enrich output explicit that enqueue/clear counts are chunk tasks.
- Add optional doc-level stats views (
docs_total,docs_enriched,docs_partial).
Rollout Plan
Phase 1 (safe MVP)
- Add chunk-local summary/keywords fields (optional, small payload).
- Keep current doc-level summary writes unchanged.
- Update CLI/help text to explicitly call out chunk-task semantics.
Phase 2
- Introduce document-level enrichment step decoupled from last chunk.
- Make it idempotent and retry-safe.
Phase 3
- Use chunk-local signals in ranking/snippet generation behind a feature flag.
- Measure retrieval quality impact before default-on.
Acceptance Criteria
- Document summary generation no longer depends on "last chunk" execution order.
- No last-chunk-only degradation path when
allChunksmissing. - CLI clearly communicates task granularity and doc-level progress.
- Query quality benchmark for long documents shows no regression; target improvement in top-k relevance/snippet usefulness.
Non-goals
- No full metadata duplication from document to all chunks.
- No breaking API changes for existing query/ingest clients.
Why now
Recent bookmark ingestion/enrichment runs highlighted user confusion (URLs vs chunks) and surfaced the architectural coupling around last-chunk document extraction. Addressing this now reduces correctness risk and improves operational clarity.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels