Skip to content

Latest commit

 

History

History
424 lines (316 loc) · 18.8 KB

File metadata and controls

424 lines (316 loc) · 18.8 KB

CLAUDE.md

This file provides guidance to Claude Code when working with knowledge-bank-tools.


Repository Purpose

ChromaDB-based vector database system with FastAPI server providing semantic search across professional knowledge sources. Enables credibility-weighted vocabulary extraction for resume generation and semantic similarity queries.


Directory Structure

knowledge-bank-tools/
├── src/                       <- Source code
│   ├── kb_server.py           <- FastAPI server (port 8765)
│   ├── kb_mcp_server.py       <- MCP server wrapper (port 8767)
│   ├── kb_mcp_config.py       <- MCP server configuration (Pydantic Settings)
│   ├── vector_kb.py           <- ChromaDB interface
│   ├── batch_ingest.py        <- Batch ingestion processor
│   ├── kb_cli.py              <- CLI interface (click)
│   ├── cache.py               <- TTL/LRU caching (search + vocabulary)
│   ├── search_processing.py   <- Result deduplication and credibility ranking
│   ├── transcript_chunking.py <- YouTube transcript chunking utilities
│   ├── extractors/            <- Content extractors (one per source type)
│   │   ├── audio_analysis_extractor.py  <- Audio via audio-analysis-mcp
│   │   ├── blog_post_extractor.py       <- Blog posts
│   │   ├── book_extractor.py            <- Books (chapter-aware extraction)
│   │   ├── case_study_extractor.py      <- Case studies
│   │   ├── generic_pdf_extractor.py     <- PDFs via document-analysis-mcp
│   │   ├── github_extractor.py          <- GitHub repositories
│   │   ├── linkedin_extractor.py        <- LinkedIn profiles
│   │   ├── meeting_notes_extractor.py   <- Meeting notes
│   │   ├── personal_analysis_extractor.py <- Personal analysis documents
│   │   ├── research_paper_extractor.py  <- Research papers
│   │   ├── technical_article_extractor.py <- Technical articles
│   │   └── youtube_extractor.py         <- YouTube videos
│   ├── api/                   <- API modules (vocabulary extraction)
│   ├── client/                <- Client libraries
│   └── providers/             <- Provider interfaces
├── deploy/                    <- Deployment configuration
│   ├── knowledge-bank-tools.service  <- KB FastAPI systemd service
│   ├── knowledge-bank-mcp.service    <- MCP server systemd service
│   ├── .env.production               <- KB server env template
│   └── .env.mcp.production           <- MCP server env template
├── tests/                     <- Test suite
├── docs/                      <- Documentation
└── pyproject.toml             <- Package configuration

Environment Variables

All paths are configurable via environment variables with sensible defaults:

Variable Default Description
KB_DATA_DIR /var/lib/knowledge-bank-tools Base directory for all KB data
KB_HOST 127.0.0.1 Server bind address
KB_PORT 8765 Server port
LOG_LEVEL INFO Logging verbosity

Caching Variables (in .env):

Variable Default Description
KB_CACHE_ENABLED true Enable/disable all caching
KB_CACHE_SEARCH_TTL 60 Search cache TTL in seconds
KB_CACHE_SEARCH_MAXSIZE 64 Max search cache entries
KB_CACHE_VOCAB_TTL 600 Vocabulary cache TTL in seconds (10 min)
KB_CACHE_VOCAB_MAXSIZE 32 Max vocabulary cache entries

MCP Server Variables (in .env.mcp):

Variable Default Description
KNOWLEDGE_BANK_MCP_HOST 127.0.0.1 MCP server bind address
KNOWLEDGE_BANK_MCP_PORT 8767 MCP server port
KB_API_URL http://127.0.0.1:8765 Upstream KB API URL
KB_MCP_REQUEST_TIMEOUT 30 Upstream request timeout (seconds)

Data directory structure (under $KB_DATA_DIR):

knowledge-bank-tools/        <- /var/lib/knowledge-bank-tools/
├── vector_db/               <- ChromaDB persistent storage
├── inbox/                   <- Files awaiting ingestion
└── archive/                 <- Processed files

Deployment (game-da-god)

Code: /home/deploy/prod/knowledge-bank-tools/ (read-only for agents)

Data: /var/lib/knowledge-bank-tools/ (read-write, owned by deploy via StateDirectory)

Services:

  • KB API: sudo systemctl status knowledge-bank-tools
  • MCP Server: sudo systemctl status knowledge-bank-mcp

Installation

# As deploy user
cd /home/deploy/prod/knowledge-bank-tools
python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -e .

# Copy environment template
cp deploy/.env.production .env
# Edit .env with actual values if needed

# Create cache directories (required for ProtectHome=read-only)
# ML libraries (torch, huggingface, sentence-transformers, triton) write to ~/.cache/
# which is blocked by ProtectHome=read-only. These dirs redirect all cache writes.
sudo mkdir -p /var/cache/knowledge-bank-tools/{huggingface,torch,sentence-transformers,triton}
sudo chown -R deploy:deploy /var/cache/knowledge-bank-tools

# Data directory: /var/lib/knowledge-bank-tools/ is auto-created by
# StateDirectory=knowledge-bank-tools in the systemd service file, owned by deploy:deploy.
# For manual setup or first-time install, run deploy/install.sh.

# Install systemd service
sudo cp deploy/knowledge-bank-tools.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable knowledge-bank-tools
sudo systemctl start knowledge-bank-tools

MCP Server Installation

The MCP server is a lightweight wrapper that delegates to the KB HTTP API. It shares the same venv (all deps installed by pip install -e .) but has its own systemd service and env file.

# Copy MCP environment template
cp deploy/.env.mcp.production .env.mcp

# Install MCP systemd service
sudo cp deploy/knowledge-bank-mcp.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable knowledge-bank-mcp
sudo systemctl start knowledge-bank-mcp

# Register MCP client with Claude Code (one-time, runs as krisoye)
claude mcp add --transport http knowledge-bank http://localhost:8767/mcp -s user

Verification

# Check service status
sudo systemctl status knowledge-bank-tools
sudo systemctl status knowledge-bank-mcp

# Health checks
curl http://127.0.0.1:8765/health     # KB API
curl http://127.0.0.1:8767/health     # MCP server

# View logs
sudo journalctl -u knowledge-bank-tools -f
sudo journalctl -u knowledge-bank-mcp -f

Migrating from /home/krisoye/knowledge-bank

If upgrading from a pre-2026-02-06 deployment:

# Stop the service
sudo systemctl stop knowledge-bank-tools

# Copy data to new location
sudo cp -a /home/krisoye/knowledge-bank/* /var/lib/knowledge-bank-tools/
sudo chown -R deploy:deploy /var/lib/knowledge-bank-tools

# Verify and restart
sudo systemctl start knowledge-bank-tools
curl http://localhost:8765/health

Development

Agent Invocation

With editable install, CLI commands are available directly:

# CLI interface
.venv/bin/kb-cli search "query term"
.venv/bin/kb-cli list-sources
.venv/bin/kb-cli stats

# Batch ingestion
.venv/bin/kb-batch-ingest --inbox /path/to/data

# Server (for development)
.venv/bin/uvicorn src.kb_server:app --host 127.0.0.1 --port 8765

Development Workflow

# Create workspace
ws start knowledge-bank-tools <feature-name> --files "..." --description "..." --install

# Work in workspace
cd ~/wip/<session-id>/knowledge-bank-tools/
source .venv/bin/activate

# Run tests
ws test

Common Commands

Command Purpose
sudo systemctl start knowledge-bank-tools Start KB API service
sudo systemctl start knowledge-bank-mcp Start MCP server
sudo systemctl status knowledge-bank-tools Check KB API status
sudo systemctl status knowledge-bank-mcp Check MCP server status
curl http://127.0.0.1:8765/health KB API health check
curl http://127.0.0.1:8765/cache_stats Cache hit rates and sizes
curl http://127.0.0.1:8767/health MCP server health check
python -m src.kb_mcp_server Run MCP server locally
pytest tests/ Run test suite

Integration Points

Dependencies

  • ChromaDB - Vector database backend
  • sentence-transformers - Embedding models (CPU-only on game-da-god)
  • FastAPI - REST API server
  • torch - PyTorch backend for embeddings
  • FastMCP - MCP server framework (Streamable HTTP transport)
  • cachetools - TTL/LRU cache implementation for search and vocabulary results
  • audio-analysis-mcp (port 8420) - Audio transcription and diarization, consumed by AudioAnalysisExtractor and the /ingest_audio endpoint
  • document-analysis-mcp (port 8766) - PDF extraction, consumed by GenericPDFExtractor

Dependents

  • Claude Code - MCP tools accessible from any Claude Code session
  • career-navigator-tools - ResumeVocabularyProvider client
  • linkedin-jobs-automation - Profile ingestion
  • Librarian agent - Orchestrates knowledge ingestion workflows via MCP tools

API Endpoints (KB Server - port 8765)

  • GET /health - Health check with cache stats, initialization status, and source count
  • GET /cache_stats - Cache hit rates, sizes, and TTL configuration for search and vocabulary caches
  • POST /search - Semantic search with optional metadata filter, deduplication, and credibility ranking
  • POST /search_by_person - Person-clustered search (finds all sources by a specific person)
  • POST /search_by_domain - Domain-filtered search with credibility/staleness quality filters
  • POST /list_sources - Metadata-only enumeration using ChromaDB where clause (no semantic search)
  • POST /ingest - Add a new document to the knowledge base
  • POST /ingest_youtube - Ingest YouTube videos by URL (yt-dlp + transcript extraction pipeline)
  • POST /ingest_audio - Ingest audio files via audio-analysis-mcp (transcription + diarization pipeline)
  • GET /show/{source_id} - Retrieve full content and metadata for a specific source
  • GET /related/{source_id} - Find sources related through relationship graph traversal
  • DELETE /source/{source_id} - Delete a source from the database
  • PATCH /source/{source_id} - Update metadata fields for an existing source
  • GET /stats - Database statistics (counts by type, domain, credibility, staleness, person)
  • POST /api/vocabulary/extract - Role-specific vocabulary extraction for resume generation
  • POST /api/vocabulary/phrases - Context-specific phrase extraction (resume_bullet, cover_letter, linkedin_summary)
  • POST /api/vocabulary/compare - Vocabulary comparison between two roles for career transition analysis

MCP Tools (MCP Server - port 8767)

Tool Description
health_check Server health and upstream KB API connectivity
search Semantic search with optional source type filter
search_with_filter Metadata-filtered search or enumeration (semantic + filter mode, or filter-only mode). Supports ChromaDB $eq, $ne, $in, $and, $or, $gt, $lt. Filter-only mode (no query) enumerates sources matching the where clause without semantic ranking. YouTube top-level filterable fields: channel_name, video_views, video_duration.
search_by_domain Domain-filtered search with credibility/staleness filters
search_by_person Person-clustered search (persona profiles)
extract_vocabulary Role-specific vocabulary extraction for resumes
extract_phrases Context-specific phrase extraction
compare_vocabulary Vocabulary comparison between two roles
ingest_source Add new knowledge sources to the database
update_source_metadata Update metadata fields for an existing source (person, credibility, domains, topics, etc.)
ingest_youtube Ingest YouTube videos by URL (max 20 per request; handles yt-dlp, transcript, credibility scoring)
ingest_audio Ingest audio files by path (max 10 per request; delegates transcription to audio-analysis-mcp)
get_stats Database statistics (counts by type, domain, credibility, staleness, person)
get_source Retrieve full content and metadata for a specific source by ID
list_related_sources Find related sources via relationship graph traversal (depth 1-3, inbound/outbound/both)

MCP endpoint: http://localhost:8767/mcp Health endpoint: http://localhost:8767/health Transport: Streamable HTTP (stateless)


Recent Features

Query and Vocabulary Caching (PR #65)

Added TTL-based caching for search results and vocabulary extraction to reduce compute overhead:

  • Search cache: 60-second TTL, max 64 entries. Cache keys include query text, n_results, and filter dict. Invalidated automatically after any ingestion or deletion.
  • Vocabulary cache: 10-minute TTL, max 32 entries. Cache keys normalized by role, industry, credibility, and n_gram_range.
  • Cache statistics exposed via GET /cache_stats and included in GET /health response.
  • All cache settings configurable via KB_CACHE_* environment variables (see Environment Variables section).

New Extractors: Blog Posts, Technical Articles, Case Studies (PR #68)

Added three new content-type extractors:

  • BlogPostExtractor - Extracts blog posts with author, publication date, tags, and reading time estimation
  • TechnicalArticleExtractor - Handles technical articles with code block detection and section-aware chunking
  • CaseStudyExtractor - Extracts case studies with problem/solution structure and outcome identification

All three follow the BaseExtractor interface and integrate with the ingest_source MCP tool via the source_type field.

Book Chapter-Aware Extraction (PR #66)

BookExtractor now supports chapter-aware extraction via extract_multi():

  • Returns one parent ExtractionResult for the full book plus one per detected chapter
  • Chapter detection uses regex patterns for "Chapter N", "Part N" headings and Roman numerals
  • Chapter relationships are stored in ChromaDB metadata for list_related_sources traversal
  • Use BookExtractor.extract_multi() (not extract()) to get per-chapter granularity

Troubleshooting

Server Connection Refused

sudo systemctl status knowledge-bank-tools
sudo journalctl -u knowledge-bank-tools --since "5 min ago"

Slow First Startup

First startup downloads the sentence-transformers model (~100 seconds). Subsequent starts are faster.

Read-only File System Errors

If the service crashes with OSError: [Errno 30] Read-only file system: '/home/deploy/.cache/...':

# First, check the service logs for the exact error path
sudo journalctl -u knowledge-bank-tools --since "5 min ago" | grep "Read-only"

# Verify cache directories exist and are writable by deploy user
ls -la /var/cache/knowledge-bank-tools/
# Should show: huggingface, torch, sentence-transformers, triton

# If missing, recreate:
sudo mkdir -p /var/cache/knowledge-bank-tools/{huggingface,torch,sentence-transformers,triton}
sudo chown -R deploy:deploy /var/cache/knowledge-bank-tools

# Verify the service file has all cache Environment= directives:
grep -c 'Environment=' /etc/systemd/system/knowledge-bank-tools.service
# Should be 10+ (cache vars + venv + path)

# After fixing, reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart knowledge-bank-tools

Embedding Model Not Found

# Re-download embedding model
.venv/bin/python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Audio Ingestion Fails

The /ingest_audio endpoint delegates transcription to audio-analysis-mcp (port 8420). Verify it is running:

sudo systemctl status audio-analysis-mcp
curl http://127.0.0.1:8420/health

If audio-analysis-mcp is down, audio ingestion will fail with a connection error. The environment variable AUDIO_ANALYSIS_MCP_URL (default: http://127.0.0.1:8420) can be used to point to an alternate host.


Development Workflow

Authoritative Reference: See MULTI-AGENT-COLLABORATION.md

Current Priorities: See GitHub Issues


Meta

Template: CLAUDE-repo.template.md from project-tracker/templates/

Last validated: 2026-02-24


Changelog

Date Changes
2026-02-24 CLAUDE.md audit (issue #10): Added 4 missing MCP tools (update_source_metadata, ingest_youtube, ingest_audio, list_related_sources was already present but undocumented in old table). Added 7 missing API endpoints (/ingest_youtube, /ingest_audio, /show/{id}, /related/{id}, /delete/{id}, /cache_stats, PATCH /source/{id}). Documented recent features: caching (PR #65), new extractors (PR #68), book chapter extraction (PR #66). Added KB_CACHE_* env vars. Expanded Dependencies section to include audio-analysis-mcp, document-analysis-mcp, cachetools, and Librarian agent. Updated extractors/ directory listing. Deleted deprecated kb-service script and knowledge-bank.service (Windows paths).
2026-02-18 Flattened packaging: moved [server], [mcp], [client] extras into base dependencies. pip install -e . now installs all runtime deps. Only [dev] extra remains. Matches pattern used by QRT/AAM/DAM repos.
2026-02-07 Added MCP server wrapper (kb_mcp_server.py, kb_mcp_config.py). Port 8767. Exposes 11 MCP tools wrapping the KB HTTP API. Includes systemd service, env template, registration instructions, and 41 tests.
2026-02-06 Migrated data directory from /home/krisoye/knowledge-bank to /var/lib/knowledge-bank-tools using systemd StateDirectory. Eliminates ownership conflicts between krisoye and deploy users. FHS-compliant.
2026-02-06 Added cache directory configuration for ProtectHome=read-only (HF_HOME, TORCH_HOME, SENTENCE_TRANSFORMERS_HOME, TRITON_CACHE_DIR, XDG_CACHE_HOME, TORCH_EXTENSIONS_DIR, TOKENIZERS_PARALLELISM). Added troubleshooting section for read-only filesystem errors.
2026-02-06 Replaced hardcoded paths with KB_DATA_DIR env var. Added deploy/ directory with systemd service and .env template. Updated for game-da-god deployment.
2026-02-01 Migrated to template structure.
2026-01-26 Updated workspace manager documentation from v2 to v3.1.
2026-01-23 Initial documentation.