Skip to content

krisoye/knowledge-bank-tools

Repository files navigation

Knowledge Bank — Semantic Search & Vector Database

Tests Python 3.10+ License: MIT

A ChromaDB-powered vector database server for semantic search across diverse knowledge sources. Supports 17+ source types with credibility scoring, staleness tracking, and automated batch ingestion. Exposes a FastAPI REST server and an MCP interface for direct use from Claude and other MCP clients.


Features

  • Semantic search — Find content by meaning, not just keywords (sentence-transformers embeddings)
  • 17+ source types — Profiles, papers, videos, meeting notes, blog posts, and more
  • Multi-signal credibility scoring — Automatic authority weighting per source type (novice through expert)
  • Staleness-aware retrieval — Time-aware results with configurable staleness levels
  • Relationship modeling — Traverse connections between sources (depth 1–3, inbound/outbound)
  • Vocabulary extraction — Extract domain-specific terminology from source collections
  • Person clustering — Unified view of a person across LinkedIn, YouTube, GitHub, etc.
  • FastAPI REST server — Persistent server with automatic ChromaDB persistence and TTL caching
  • MCP integration — 14 tools exposed directly to Claude and other MCP clients
  • Automated batch ingestion — Intelligent routing across 12 content-type extractors

Architecture

The system has two server processes:

  • KB API (port 8765) — FastAPI server backed by ChromaDB. Handles all ingestion, search, and vocabulary operations. Sentence-transformers (all-MiniLM-L6-v2) generates embeddings on ingest and at query time. TTL/LRU caches reduce repeated embedding overhead.
  • MCP server (port 8767) — Lightweight FastMCP wrapper that proxies all operations to the KB API. Exposes MCP tools to Claude Code and other MCP clients over Streamable HTTP.

Content extractors (one per source type) parse raw files and produce structured metadata for ingestion. The batch processor auto-routes files in an inbox directory to the correct extractor.


MCP Tools

The MCP server exposes the following tools (registered at http://localhost:8767/mcp):

Tool Description
health_check Server health and upstream KB API connectivity
search Semantic search with optional source type filter
search_with_filter Metadata-filtered search or enumeration. Supports ChromaDB $eq, $ne, $in, $and, $or, $gt, $lt
search_by_domain Domain-filtered search with credibility and staleness quality filters
search_by_person Person-clustered search across all sources for a given person
extract_vocabulary Role-specific vocabulary extraction from the knowledge base
extract_phrases Context-specific phrase extraction (summaries, domain terminology, etc.)
compare_vocabulary Vocabulary comparison between two roles
ingest_source Add a new knowledge source to the database
update_source_metadata Update metadata fields for an existing source
ingest_youtube Ingest YouTube videos by URL (up to 20 per request)
ingest_audio Ingest audio files by path (delegates to audio-analysis-mcp)
get_stats Database statistics by type, domain, credibility, staleness, and person
get_source Retrieve full content and metadata for a source by ID
list_related_sources Traverse relationship graph to find related sources

Quick Start

Prerequisites

  • Python 3.10+

Installation

git clone https://github.com/krisoye/knowledge-bank-tools.git
cd knowledge-bank-tools
pip install -e ".[dev]"

Start the server

# Start the KB API server (port 8765)
uvicorn src.kb_server:app --host 127.0.0.1 --port 8765

# In a separate terminal, start the MCP server (port 8767)
python -m src.kb_mcp_server

The KB API server will download the sentence-transformers embedding model on first startup (roughly 100 seconds). Subsequent starts are fast.

Register with Claude Code

claude mcp add --transport http knowledge-bank http://localhost:8767/mcp -s user

Production (systemd)

# Install KB API service
sudo cp deploy/knowledge-bank-tools.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-bank-tools

# Install MCP service
sudo cp deploy/knowledge-bank-mcp.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-bank-mcp

See deploy/.env.production and deploy/.env.mcp.production for environment variable templates.


REST API

The KB API server runs on port 8765 and provides the following key endpoints:

Health

curl http://localhost:8765/health

Semantic Search

curl -X POST http://localhost:8765/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning time series forecasting",
    "n_results": 10,
    "source_type": "research_paper"
  }'

Vocabulary Extraction

curl -X POST http://localhost:8765/api/vocabulary/extract \
  -H "Content-Type: application/json" \
  -d '{
    "role": "Senior Data Scientist",
    "industry": "Fintech",
    "min_credibility": "practitioner",
    "top_n": 50
  }'

Ingest a Source

curl -X POST http://localhost:8765/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source_type": "blog_post",
    "content": "Full text of the article...",
    "metadata": {
      "title": "Introduction to Vector Databases",
      "author": "Jane Smith",
      "url": "https://example.com/vector-databases"
    }
  }'

Batch Ingestion

# Place files in the inbox directory, then run:
kb-batch-ingest --inbox /path/to/inbox/

Full API Endpoint Reference

Method Path Description
GET /health Health check with cache stats and source count
GET /cache_stats Cache hit rates, sizes, and TTL configuration
GET /stats Database statistics by type, domain, credibility
POST /search Semantic search with optional metadata filter
POST /search_by_person Person-clustered search across all their sources
POST /search_by_domain Domain-filtered search with quality filters
POST /list_sources Metadata enumeration without semantic ranking
POST /ingest Add a document to the knowledge base
POST /ingest_youtube Ingest YouTube videos by URL
POST /ingest_audio Ingest audio files via audio-analysis-mcp
GET /show/{source_id} Full content and metadata for a specific source
GET /related/{source_id} Related sources via relationship graph traversal
PATCH /source/{source_id} Update metadata fields for an existing source
DELETE /source/{source_id} Remove a source from the database
POST /api/vocabulary/extract Role-specific vocabulary extraction
POST /api/vocabulary/phrases Context-specific phrase extraction
POST /api/vocabulary/compare Vocabulary comparison between two roles

Supported Source Types

Profiles

  • linkedin_profile — Professional credentials and work history
  • github_profile — Code contributions and repository analysis
  • twitter_profile — Social media presence

Video

  • youtube_video — Transcripts with engagement metrics and credibility scoring

Documents

  • research_paper — arXiv, SSRN, journals with venue tier and citation tracking
  • book — Chapter-aware extraction with per-chapter relationship storage
  • case_study — Problem/solution structure with outcome identification
  • whitepaper — Technical documents

Conversations

  • meeting_notes — Participants, outcomes, and temporal context
  • meeting_audio — Transcribed audio (via audio-analysis-mcp)
  • interview_transcript — Formal interview transcripts

Social & Articles

  • reddit_discussion — Posts and threads
  • blog_post — Articles with author, publication date, and reading time
  • technical_article — Technical writing with code block detection

Personal

  • personal_analysis — Self-generated analysis documents
  • self_note — Personal notes

Automated Extractors

The batch ingestion system includes 12 content-type extractors. Each follows a common BaseExtractor interface and produces structured metadata for ingestion:

  1. LinkedIn extractor — PDF profiles with seniority, company tier, and credibility scoring
  2. Meeting notes extractor — Markdown meeting files with participant and outcome extraction
  3. Personal analysis extractor — Analysis documents with subject and purpose inference
  4. YouTube extractor — Credibility scoring from engagement, domain classification
  5. Research paper extractor — arXiv/SSRN support, venue tier, citation tracking
  6. GitHub extractor — Star counts, language detection, repository analysis
  7. Blog post extractor — Author, tags, and reading time estimation
  8. Technical article extractor — Section-aware chunking with code block detection
  9. Case study extractor — Problem/solution structure recognition
  10. Book extractor — Chapter-aware extraction (extract_multi() for per-chapter granularity)
  11. Audio analysis extractor — Delegates to audio-analysis-mcp for transcription and diarization
  12. Generic PDF extractor — Delegates to document-analysis-mcp for general PDF extraction

Configuration

Key environment variables (set in .env or systemd service):

Variable Default Description
KB_HOST 127.0.0.1 Server bind address
KB_PORT 8765 Server port
KB_DATA_DIR /var/lib/knowledge-bank-tools ChromaDB and inbox storage
KB_CACHE_ENABLED true Enable TTL/LRU caching
KB_CACHE_SEARCH_TTL 60 Search cache TTL in seconds
KB_CACHE_VOCAB_TTL 600 Vocabulary cache TTL in seconds
KNOWLEDGE_BANK_MCP_PORT 8767 MCP server port
KB_API_URL http://127.0.0.1:8765 KB API URL (used by MCP server)

Development

# Run tests
pytest tests/

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

Tech Stack

Component Library
Vector database ChromaDB
Embeddings sentence-transformers (all-MiniLM-L6-v2)
REST server FastAPI + uvicorn
MCP server FastMCP (Streamable HTTP)
PDF extraction pdfplumber, PyPDF2
YouTube ingestion yt-dlp, youtube-transcript-api
Data validation Pydantic + pydantic-settings
Caching cachetools (TTL/LRU)
CLI Click

License

MIT — see LICENSE.

About

A ChromaDB-powered vector database server for semantic search across diverse knowledge sources. Supports 17+ source types with credibility scoring, staleness tracking, and automated batch ingestion. Exposes a FastAPI REST server and an MCP interface for direct use from Claude and other MCP clients.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors