Skip to content

Design: Hybrid enrichment model (document-level canonical + chunk-level local signals) #86

@mfittko

Description

@mfittko

Summary

Current enrichment is chunk-queued but document-summary generation is triggered only in the task that processes the last chunk. This creates a fragile coupling between queue/task mechanics and document-level quality.

We should formalize a hybrid model:

  • Document-level metadata/summary as canonical truth
  • Chunk-level metadata/summary as optional retrieval/ranking signal

Current Status Quo (as implemented)

  • Enrichment is queued and tracked per chunk (task_queue + chunks.enrichment_status).
  • Worker runs tier-2 per chunk, but tier-3 document extraction only when chunkIndex == totalChunks - 1.
  • If allChunks is missing, tier-3 may run with only last-chunk text.
  • API strips summary fields from chunk tier3_meta and stores summary variants on documents.

Design Flaws / Risks

  1. Last-chunk coupling
    • Document-level output depends on a specific chunk task being claimed and completed.
  2. Input completeness risk
    • Missing allChunks can degrade document summaries (last-chunk-only fallback).
  3. Operational confusion
    • Users think in documents/URLs, but queue and force/clear operate in chunk tasks.
  4. No localized semantic signal
    • Retrieval quality for long docs can improve with chunk-local summaries/keywords, but we currently only keep doc-level summaries.

Proposed Target Design

1) Keep document-level canonical outputs

  • Keep documents.summary_short|summary_medium|summary_long|summary as source of truth.
  • Keep global entities/relationships at document graph level.

2) Add lightweight chunk-level semantic signals

  • Add optional chunk-local fields (e.g. in chunks.tier3_meta):
    • chunk_summary_short (1 sentence)
    • chunk_keywords (small list)
  • Do not duplicate full document summary into every chunk.

3) Decouple document-level extraction from "last chunk task"

  • Preferred: explicit document-level aggregation/extraction step that reads all chunks from DB.
  • Alternative (interim): guarantee allChunks presence + idempotent guard so document-level extraction runs once per doc and is retry-safe.

4) Clarify CLI semantics

  • Make enrich output explicit that enqueue/clear counts are chunk tasks.
  • Add optional doc-level stats views (docs_total, docs_enriched, docs_partial).

Rollout Plan

Phase 1 (safe MVP)

  • Add chunk-local summary/keywords fields (optional, small payload).
  • Keep current doc-level summary writes unchanged.
  • Update CLI/help text to explicitly call out chunk-task semantics.

Phase 2

  • Introduce document-level enrichment step decoupled from last chunk.
  • Make it idempotent and retry-safe.

Phase 3

  • Use chunk-local signals in ranking/snippet generation behind a feature flag.
  • Measure retrieval quality impact before default-on.

Acceptance Criteria

  • Document summary generation no longer depends on "last chunk" execution order.
  • No last-chunk-only degradation path when allChunks missing.
  • CLI clearly communicates task granularity and doc-level progress.
  • Query quality benchmark for long documents shows no regression; target improvement in top-k relevance/snippet usefulness.

Non-goals

  • No full metadata duplication from document to all chunks.
  • No breaking API changes for existing query/ingest clients.

Why now

Recent bookmark ingestion/enrichment runs highlighted user confusion (URLs vs chunks) and surfaced the architectural coupling around last-chunk document extraction. Addressing this now reduces correctness risk and improves operational clarity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions