Skip to content

EPIC: Parent/chunk data model for YouTube videos (Option C) #78

@krisoye

Description

@krisoye

Problem

The KB stores YouTube video transcript chunks as separate ChromaDB documents, each with its own source ID (e.g., youtube_IDRbItj4RGg_chunk_0 through _chunk_40). A single 63-minute video becomes 41 separate "sources." This inflates stats (1,067 youtube_video "sources" are really ~50-70 unique videos) and confuses the admin-dashboard (which shows 41 rows per video).

Solution: Option C — Parent/Chunk Data Model

Introduce a parent source document per video (metadata only, no embedding), link chunks via related_source_ids (like books already do with chapters), and update stats/listing/display to count parents not chunks.

Existing Pattern to Follow

BookExtractor already implements parent + chapter via extract_multi() — one parent ExtractionResult plus per-chapter results linked through related_source_ids. YouTube should follow this same pattern.

Sub-Issues

Suggested Implementation Order

  1. KB Core: Add parent source document support to VectorKB #79 (VectorKB core changes — foundation for everything else)
  2. KB API: YouTube ingestion creates parent + linked chunks #80 (YouTube ingestion pipeline)
  3. KB API: Stats and list_sources respect parent/chunk model #81 (Stats and list_sources — enables accurate counts)
  4. MCP: Update tools for parent/chunk model #83 (MCP tool updates — expose new API surface to Claude)
  5. Data migration: Create parent documents for existing YouTube chunks #82 (Migration script — run against production after KB Core: Add parent source document support to VectorKB #79-KB API: Stats and list_sources respect parent/chunk model #81 are deployed)
  6. krisoye/admin-dashboard#19 (Dashboard — run after KB API is deployed and migration complete)

Key Code Locations

  • src/vector_kb.py — ChromaDB interface, add_source() (~line 277), get_stats() (~line 994)
  • src/kb_server.py — YouTube ingest pipeline (lines 881-1180), chunk loop (lines 997-1064)
  • src/transcript_chunking.py — 2-min window chunking (lines 21-126)
  • src/kb_mcp_server.py — MCP tool wrappers
  • admin-dashboard/src/backends/knowledge_bank.py — KB API client
  • admin-dashboard/src/templates/kb/sources.html — list page
  • admin-dashboard/src/templates/kb/partials/source_row.html — row renderer

Acceptance Criteria

  • A 63-minute YouTube video produces 1 parent document + N chunk documents in ChromaDB
  • GET /stats reports ~50-70 youtube_video sources (not 1,067)
  • POST /list_sources returns parent documents by default (not chunks)
  • Admin dashboard shows one row per video with a chunk count badge
  • Migration script is idempotent and handles existing data
  • All existing tests pass; new tests cover parent/chunk round-trip

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicLarge multi-issue initiative

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions