Knowledge Bank — Semantic Search & Vector Database

A ChromaDB-powered vector database server for semantic search across diverse knowledge sources. Supports 17+ source types with credibility scoring, staleness tracking, and automated batch ingestion. Exposes a FastAPI REST server and an MCP interface for direct use from Claude and other MCP clients.

Features

Semantic search — Find content by meaning, not just keywords (sentence-transformers embeddings)
17+ source types — Profiles, papers, videos, meeting notes, blog posts, and more
Multi-signal credibility scoring — Automatic authority weighting per source type (novice through expert)
Staleness-aware retrieval — Time-aware results with configurable staleness levels
Relationship modeling — Traverse connections between sources (depth 1–3, inbound/outbound)
Vocabulary extraction — Extract domain-specific terminology from source collections
Person clustering — Unified view of a person across LinkedIn, YouTube, GitHub, etc.
FastAPI REST server — Persistent server with automatic ChromaDB persistence and TTL caching
MCP integration — 14 tools exposed directly to Claude and other MCP clients
Automated batch ingestion — Intelligent routing across 12 content-type extractors

Architecture

The system has two server processes:

KB API (port 8765) — FastAPI server backed by ChromaDB. Handles all ingestion, search, and vocabulary operations. Sentence-transformers (all-MiniLM-L6-v2) generates embeddings on ingest and at query time. TTL/LRU caches reduce repeated embedding overhead.
MCP server (port 8767) — Lightweight FastMCP wrapper that proxies all operations to the KB API. Exposes MCP tools to Claude Code and other MCP clients over Streamable HTTP.

Content extractors (one per source type) parse raw files and produce structured metadata for ingestion. The batch processor auto-routes files in an inbox directory to the correct extractor.

MCP Tools

The MCP server exposes the following tools (registered at http://localhost:8767/mcp):

Tool	Description
`health_check`	Server health and upstream KB API connectivity
`search`	Semantic search with optional source type filter
`search_with_filter`	Metadata-filtered search or enumeration. Supports ChromaDB `$eq`, `$ne`, `$in`, `$and`, `$or`, `$gt`, `$lt`
`search_by_domain`	Domain-filtered search with credibility and staleness quality filters
`search_by_person`	Person-clustered search across all sources for a given person
`extract_vocabulary`	Role-specific vocabulary extraction from the knowledge base
`extract_phrases`	Context-specific phrase extraction (summaries, domain terminology, etc.)
`compare_vocabulary`	Vocabulary comparison between two roles
`ingest_source`	Add a new knowledge source to the database
`update_source_metadata`	Update metadata fields for an existing source
`ingest_youtube`	Ingest YouTube videos by URL (up to 20 per request)
`ingest_audio`	Ingest audio files by path (delegates to audio-analysis-mcp)
`get_stats`	Database statistics by type, domain, credibility, staleness, and person
`get_source`	Retrieve full content and metadata for a source by ID
`list_related_sources`	Traverse relationship graph to find related sources

Quick Start

Prerequisites

Python 3.10+

Installation

git clone https://github.com/krisoye/knowledge-bank-tools.git
cd knowledge-bank-tools
pip install -e ".[dev]"

Start the server

# Start the KB API server (port 8765)
uvicorn src.kb_server:app --host 127.0.0.1 --port 8765

# In a separate terminal, start the MCP server (port 8767)
python -m src.kb_mcp_server

The KB API server will download the sentence-transformers embedding model on first startup (roughly 100 seconds). Subsequent starts are fast.

Register with Claude Code

claude mcp add --transport http knowledge-bank http://localhost:8767/mcp -s user

Production (systemd)

# Install KB API service
sudo cp deploy/knowledge-bank-tools.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-bank-tools

# Install MCP service
sudo cp deploy/knowledge-bank-mcp.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-bank-mcp

See deploy/.env.production and deploy/.env.mcp.production for environment variable templates.

REST API

The KB API server runs on port 8765 and provides the following key endpoints:

Health

curl http://localhost:8765/health

Semantic Search

curl -X POST http://localhost:8765/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning time series forecasting",
    "n_results": 10,
    "source_type": "research_paper"
  }'

Vocabulary Extraction

curl -X POST http://localhost:8765/api/vocabulary/extract \
  -H "Content-Type: application/json" \
  -d '{
    "role": "Senior Data Scientist",
    "industry": "Fintech",
    "min_credibility": "practitioner",
    "top_n": 50
  }'

Ingest a Source

curl -X POST http://localhost:8765/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source_type": "blog_post",
    "content": "Full text of the article...",
    "metadata": {
      "title": "Introduction to Vector Databases",
      "author": "Jane Smith",
      "url": "https://example.com/vector-databases"
    }
  }'

Batch Ingestion

# Place files in the inbox directory, then run:
kb-batch-ingest --inbox /path/to/inbox/

Full API Endpoint Reference

Method	Path	Description
`GET`	`/health`	Health check with cache stats and source count
`GET`	`/cache_stats`	Cache hit rates, sizes, and TTL configuration
`GET`	`/stats`	Database statistics by type, domain, credibility
`POST`	`/search`	Semantic search with optional metadata filter
`POST`	`/search_by_person`	Person-clustered search across all their sources
`POST`	`/search_by_domain`	Domain-filtered search with quality filters
`POST`	`/list_sources`	Metadata enumeration without semantic ranking
`POST`	`/ingest`	Add a document to the knowledge base
`POST`	`/ingest_youtube`	Ingest YouTube videos by URL
`POST`	`/ingest_audio`	Ingest audio files via audio-analysis-mcp
`GET`	`/show/{source_id}`	Full content and metadata for a specific source
`GET`	`/related/{source_id}`	Related sources via relationship graph traversal
`PATCH`	`/source/{source_id}`	Update metadata fields for an existing source
`DELETE`	`/source/{source_id}`	Remove a source from the database
`POST`	`/api/vocabulary/extract`	Role-specific vocabulary extraction
`POST`	`/api/vocabulary/phrases`	Context-specific phrase extraction
`POST`	`/api/vocabulary/compare`	Vocabulary comparison between two roles

Supported Source Types

Profiles

linkedin_profile — Professional credentials and work history
github_profile — Code contributions and repository analysis
twitter_profile — Social media presence

Video

youtube_video — Transcripts with engagement metrics and credibility scoring

Documents

research_paper — arXiv, SSRN, journals with venue tier and citation tracking
book — Chapter-aware extraction with per-chapter relationship storage
case_study — Problem/solution structure with outcome identification
whitepaper — Technical documents

Conversations

meeting_notes — Participants, outcomes, and temporal context
meeting_audio — Transcribed audio (via audio-analysis-mcp)
interview_transcript — Formal interview transcripts

Social & Articles

reddit_discussion — Posts and threads
blog_post — Articles with author, publication date, and reading time
technical_article — Technical writing with code block detection

Personal

personal_analysis — Self-generated analysis documents
self_note — Personal notes

Automated Extractors

The batch ingestion system includes 12 content-type extractors. Each follows a common BaseExtractor interface and produces structured metadata for ingestion:

LinkedIn extractor — PDF profiles with seniority, company tier, and credibility scoring
Meeting notes extractor — Markdown meeting files with participant and outcome extraction
Personal analysis extractor — Analysis documents with subject and purpose inference
YouTube extractor — Credibility scoring from engagement, domain classification
Research paper extractor — arXiv/SSRN support, venue tier, citation tracking
GitHub extractor — Star counts, language detection, repository analysis
Blog post extractor — Author, tags, and reading time estimation
Technical article extractor — Section-aware chunking with code block detection
Case study extractor — Problem/solution structure recognition
Book extractor — Chapter-aware extraction (extract_multi() for per-chapter granularity)
Audio analysis extractor — Delegates to audio-analysis-mcp for transcription and diarization
Generic PDF extractor — Delegates to document-analysis-mcp for general PDF extraction

Configuration

Key environment variables (set in .env or systemd service):

Variable	Default	Description
`KB_HOST`	`127.0.0.1`	Server bind address
`KB_PORT`	`8765`	Server port
`KB_DATA_DIR`	`/var/lib/knowledge-bank-tools`	ChromaDB and inbox storage
`KB_CACHE_ENABLED`	`true`	Enable TTL/LRU caching
`KB_CACHE_SEARCH_TTL`	`60`	Search cache TTL in seconds
`KB_CACHE_VOCAB_TTL`	`600`	Vocabulary cache TTL in seconds
`KNOWLEDGE_BANK_MCP_PORT`	`8767`	MCP server port
`KB_API_URL`	`http://127.0.0.1:8765`	KB API URL (used by MCP server)

Development

# Run tests
pytest tests/

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

Tech Stack

Component	Library
Vector database	ChromaDB
Embeddings	sentence-transformers (`all-MiniLM-L6-v2`)
REST server	FastAPI + uvicorn
MCP server	FastMCP (Streamable HTTP)
PDF extraction	pdfplumber, PyPDF2
YouTube ingestion	yt-dlp, youtube-transcript-api
Data validation	Pydantic + pydantic-settings
Caching	cachetools (TTL/LRU)
CLI	Click

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
_untracked_backup		_untracked_backup
benchmarks		benchmarks
deploy		deploy
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Knowledge Bank — Semantic Search & Vector Database

Features

Architecture

MCP Tools

Quick Start

Prerequisites

Installation

Start the server

Register with Claude Code

Production (systemd)

REST API

Health

Semantic Search

Vocabulary Extraction

Ingest a Source

Batch Ingestion

Full API Endpoint Reference

Supported Source Types

Profiles

Video

Documents

Conversations

Social & Articles

Personal

Automated Extractors

Configuration

Development

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages