Skip to content

[EPIC] Knowledge Bank Tools - Extractors & Performance #2

@krisoye

Description

@krisoye

Epic Overview

Complete medium-priority code review items, add audio transcription extractor, and implement performance enhancements for the Knowledge Bank platform.

Scope

  • Repository: knowledge-bank-tools
  • Status: v2.4.2 with YouTube ingestion, security hardening complete
  • Timeline: 5-7 days total effort

Current State

  • v2.4.2 operational with 53 sources ingested
  • Security fixes complete (CR-1, CR-2, CR-4, CR-5)
  • Medium priority items (CR-6 through CR-10) need implementation
  • LinkedIn personas integration complete (21 profiles)
  • VocabularyExtractor API operational (0.166s avg)

Objectives

  1. CR-6: Content length validation in extractors
  2. CR-7: Improved error context in VocabularyExtractor
  3. CR-8: Expanded stopwords list (60+ LinkedIn/resume noise words)
  4. CR-9: N-gram range validation with bounds checking
  5. CR-10: Server cleanup on shutdown (ChromaDB persistence)
  6. Audio Transcription Extractor: Process meeting audio files (.m4a, .mp3, .wav)
  7. Book Chapter-Aware Extraction: Split books by chapters with relationship modeling
  8. Performance Benchmarks: Throughput testing at scale (1K, 10K, 100K sources)

Success Criteria

  • All CR-6 through CR-10 items complete with tests passing
  • Audio extractor handles meeting files in inbox
  • Performance benchmarks establish baseline metrics

Related Issues

Will be linked as individual issues are created.

Reference

  • Architecture: knowledge-bank-tools/CLAUDE.md
  • Vocabulary API: knowledge-bank-tools/src/api/vocabulary_extraction.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions