You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The KB stores YouTube video transcript chunks as separate ChromaDB documents, each with its own source ID (e.g., youtube_IDRbItj4RGg_chunk_0 through _chunk_40). A single 63-minute video becomes 41 separate "sources." This inflates stats (1,067 youtube_video "sources" are really ~50-70 unique videos) and confuses the admin-dashboard (which shows 41 rows per video).
Solution: Option C — Parent/Chunk Data Model
Introduce a parent source document per video (metadata only, no embedding), link chunks via related_source_ids (like books already do with chapters), and update stats/listing/display to count parents not chunks.
Existing Pattern to Follow
BookExtractor already implements parent + chapter via extract_multi() — one parent ExtractionResult plus per-chapter results linked through related_source_ids. YouTube should follow this same pattern.
Problem
The KB stores YouTube video transcript chunks as separate ChromaDB documents, each with its own source ID (e.g.,
youtube_IDRbItj4RGg_chunk_0through_chunk_40). A single 63-minute video becomes 41 separate "sources." This inflates stats (1,067 youtube_video "sources" are really ~50-70 unique videos) and confuses the admin-dashboard (which shows 41 rows per video).Solution: Option C — Parent/Chunk Data Model
Introduce a parent source document per video (metadata only, no embedding), link chunks via
related_source_ids(like books already do with chapters), and update stats/listing/display to count parents not chunks.Existing Pattern to Follow
BookExtractoralready implements parent + chapter viaextract_multi()— one parentExtractionResultplus per-chapter results linked throughrelated_source_ids. YouTube should follow this same pattern.Sub-Issues
src/vector_kb.py)src/kb_server.py)src/vector_kb.py,src/kb_server.py)src/migrate_youtube_parents.py)src/kb_mcp_server.py)Suggested Implementation Order
Key Code Locations
src/vector_kb.py— ChromaDB interface,add_source()(~line 277),get_stats()(~line 994)src/kb_server.py— YouTube ingest pipeline (lines 881-1180), chunk loop (lines 997-1064)src/transcript_chunking.py— 2-min window chunking (lines 21-126)src/kb_mcp_server.py— MCP tool wrappersadmin-dashboard/src/backends/knowledge_bank.py— KB API clientadmin-dashboard/src/templates/kb/sources.html— list pageadmin-dashboard/src/templates/kb/partials/source_row.html— row rendererAcceptance Criteria
GET /statsreports ~50-70youtube_videosources (not 1,067)POST /list_sourcesreturns parent documents by default (not chunks)