Codebase Contextifier 9000

A Docker-based Model Context Protocol (MCP) server for semantic code search with AST-aware chunking, relationship tracking via Neo4j graph database, local LLM support, and incremental indexing.

Documentation

📚 Quick Start Guide - Get running in 5 minutes
🔧 Multi-Project Setup - Index multiple projects with shared backend
⚙️ Background Jobs - Job-based indexing for large codebases
👁️ File Watcher Guide - Real-time monitoring and auto-indexing
🔬 Research & Methodology - Deep dive into semantic code search
📖 Full Documentation - Complete docs directory

Features
Architecture
- Key Architectural Features
Quick Start
MCP Tools
- Indexing Tools: index_repository, get_job_status, list_indexing_jobs, cancel_indexing_job
- Search Tools: search_code, get_symbols
- Graph Query Tools: find_usages, find_dependencies, query_graph
- Dependency Tools: detect_dependencies, index_dependencies, list_indexed_dependencies
- Status Tools: get_indexing_status, clear_index, get_watcher_status, health_check
Supported Languages
Configuration
- Environment Variables
- Recommended Embedding Models
Performance
Troubleshooting
Development
Architecture Details
Roadmap
Research & References
License
Contributing
Support

Features

AST-Aware Chunking: Uses tree-sitter to respect function and class boundaries, maintaining semantic integrity
Relationship Tracking: Neo4j graph database tracks function calls, imports, inheritance, and dependencies across your codebase
External Dependency Mapping: Automatically creates placeholder nodes for external functions (WordPress, npm packages, etc.)
Job-Based Indexing: Background indexing with progress tracking for large codebases
On-Demand Container Spawning: Index any repository on your system without manual mounting
Multi-Repository Search: Index and search across multiple projects with a shared backend
Real-Time Updates: File system watcher automatically re-indexes changed files (optional)
Local-First: All processing happens locally using Ollama for embeddings (no data leaves your machine)
Polyglot Support: Supports 10+ programming languages including TypeScript, Python, PHP, Go, Rust, Java, C++, and more
Incremental Indexing: Merkle tree-based change detection with 80%+ cache hit rates
Production-Grade: Uses Qdrant vector database for sub-10ms search latency and Neo4j for relationship queries
Dependency Knowledge Base: Special collection for indexing WordPress plugins, Composer packages, and npm modules
Flexible Deployment: Per-project or centralized server deployment options
MCP Integration: Works with Claude Desktop, Cursor, VS Code, and other MCP-compatible tools

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  MCP Client (Claude Code, Claude Desktop, Cursor, etc.)         │
└──────────────────────────────┬──────────────────────────────────┘
                               │ MCP Protocol (stdio)
                               │
┌──────────────────────────────▼──────────────────────────────────┐
│  MCP Server Container (codebase-mcp-server)                     │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  FastMCP Server - Exposes MCP Tools:                      │  │
│  │  • index_repository (spawns indexer containers)           │  │
│  │  • search_code (semantic search across all repos)         │  │
│  │  • find_usages, find_dependencies (graph queries)         │  │
│  │  • detect_dependencies, index_dependencies                │  │
│  │  • get_job_status, list_indexing_jobs, cancel_job        │  │
│  │  • get_symbols, get_indexing_status, health_check        │  │
│  └──────────────┬───────────────────────────────────────────┘  │
│                 │                                                │
│                 │ Spawns via Docker Socket                       │
│                 ▼                                                │
│  ┌─────────────────────────────────────────────────────┐       │
│  │  On-Demand Indexer Containers (ephemeral)           │       │
│  │  • Mounts any host directory                        │       │
│  │  • AST-aware chunking with tree-sitter              │       │
│  │  • Extracts relationships (CALLS, IMPORTS, etc.)    │       │
│  │  • Generates embeddings via Ollama                  │       │
│  │  • Updates shared Qdrant & Neo4j databases          │       │
│  │  • Reports progress back to MCP server              │       │
│  └──────────────────────┬──────────────────────────────┘       │
└─────────────────────────┼──────────────────────────────────────┘
                          │
          ┌───────────────┴───────────────────┐
          │                                   │
   ┌──────▼──────┐              ┌────────────▼────────┐
   │   Qdrant    │              │      Neo4j          │
   │  Container  │              │    Container        │
   │  (Vectors)  │              │  (Relationships)    │
   └──────┬──────┘              └──────┬──────────────┘
          │                            │
   ┌──────▼────────────────────────────▼───────────┐
   │  Persistent Docker Volumes:                   │
   │  • qdrant_data (vector DB)                    │
   │  • neo4j_data (graph DB)                      │
   │  • index_data (merkle trees)                  │
   │  • cache_data (embeddings cache)              │
   └───────────────────────────────────────────────┘

   ┌────────────────────────┐
   │  Ollama (Host)         │
   │  Embedding Model       │
   └────────────────────────┘

Key Architectural Features

Dual Database Architecture: Qdrant for semantic vector search, Neo4j for relationship graph queries
Container Orchestration: MCP server spawns lightweight indexer containers on-demand via Docker socket
Multi-Repository Support: Each repository gets its own merkle tree state, but shares the vector & graph databases
Shared Backend: All projects use the same Qdrant & Neo4j instances, enabling cross-repository search and relationship tracking
Job-Based Processing: Background jobs with progress tracking for large codebases
Content-Addressable Caching: Embeddings are cached by content hash, shared across all repositories
Relationship Extraction: AST-based extraction of CALLS, IMPORTS, EXTENDS, and IMPLEMENTS relationships
External Dependency Tracking: Automatic creation of placeholder nodes for unresolved function calls

Quick Start

See QUICK_START.md for detailed setup instructions.

Prerequisites

Docker Desktop (or Docker + Docker Compose)

Ollama running locally with an embedding model:

# Install Ollama: https://ollama.ai

# Recommended: Google's Gemma embedding model (best quality)
ollama pull embeddinggemma:latest

# Alternative: Nomic Embed (faster, smaller)
ollama pull nomic-embed-text

Two Deployment Options

Option A: Centralized Server (Recommended)

Best for: Indexing from the MCP server, querying across all repositories

# 1. Start the backend
cd codebase-contextifier-9000
docker-compose up -d

# 2. Configure Claude Desktop (see below)

# 3. Index any repository
# In Claude: "Index the repository at /Users/me/projects/my-app"

Option B: Per-Project Setup

Best for: Each project manages its own indexing

# 1. Start shared backend (once)
cd codebase-contextifier-9000
docker-compose up -d

# 2. Copy .mcp.json to each project
cp .mcp.json.template ~/projects/my-app/.mcp.json

# 3. Open project in Claude Code
cd ~/projects/my-app
claude-code .

See MULTI_PROJECT_SETUP.md for details.

Claude Desktop Configuration

For Centralized Server (Option A):

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "codebase-contextifier": {
      "command": "docker",
      "args": [
        "exec",
        "-i",
        "codebase-mcp-server",
        "python",
        "-m",
        "src.server"
      ]
    }
  }
}

For Per-Project Setup (Option B):

Just copy .mcp.json.template to your project directory - no manual configuration needed!

Usage

Once configured, you can use these tools in Claude Desktop or Claude Code:

Index any repository on your system:

Claude, index the repository at /Users/me/projects/my-app

The system spawns a container, indexes the repository in the background, and reports progress.

Monitor indexing progress:

Claude, show me the status of job abc123

Search for code across all indexed repositories:

Claude, search for "authentication logic" in the codebase

Search with filters:

Claude, search for "error handling" filtering by language=python and repo_name=my-api

Extract symbols from a file:

Claude, get all functions from /workspace/src/utils.py

Find all usages of a function (graph query):

Claude, find all places where authenticate_user is called

Find dependencies of a function (graph query):

Claude, show me all functions that processPayment depends on

Detect and index external dependencies:

Claude, detect available WordPress plugins in this project
Claude, index the woocommerce plugin into the knowledge base

Check system status:

Claude, show me the indexing status and list all jobs

MCP Tools

Indexing Tools

`index_repository`

Index a repository from any directory on your host machine by spawning a lightweight indexer container.

Parameters:

host_path (string, required): Absolute path on host machine to repository (e.g., /Users/me/projects/my-app)
repo_name (string, optional): Unique identifier for this repository (defaults to directory name)
incremental (bool): Use incremental indexing to only re-index changed files (default: true)
exclude_patterns (string, optional): Comma-separated glob patterns to exclude (e.g., "node_modules/*,dist/*")

Returns:

{
  "success": true,
  "job_id": "abc123def456",
  "repo_name": "my-app",
  "status": "queued",
  "message": "Background indexing started for 'my-app'"
}

Example:

# Index a WordPress site, excluding plugins and uploads
await index_repository(
    host_path="/Users/me/sites/my-wordpress",
    repo_name="my-wordpress",
    exclude_patterns="wp-content/plugins/*,wp-content/uploads/*,wp-includes/*"
)

`get_job_status`

Get the status and progress of an indexing job.

Parameters:

job_id (string, required): Job identifier returned from index_repository

Returns:

{
  "success": true,
  "job_id": "abc123def456",
  "repo_name": "my-app",
  "repo_path": "/Users/me/projects/my-app",
  "status": "running",
  "created_at": 1698765432.123,
  "started_at": 1698765433.456,
  "elapsed_seconds": 45.2,
  "progress": {
    "current_file": 45,
    "total_files": 100,
    "progress_pct": 45.0,
    "current_file_path": "/workspace/src/api/auth.py",
    "chunks_indexed": 234,
    "failed_files_count": 2,
    "cache_hit_rate": "35.50%"
  }
}

Status values: "queued", "running", "completed", "failed", "cancelled"

`list_indexing_jobs`

List all indexing jobs (past and present).

Returns:

{
  "success": true,
  "total_jobs": 3,
  "jobs": [
    {
      "job_id": "abc123",
      "repo_name": "my-api",
      "status": "completed",
      "progress": { "progress_pct": 100.0, ... }
    },
    {
      "job_id": "def456",
      "repo_name": "frontend",
      "status": "running",
      "progress": { "progress_pct": 67.5, ... }
    }
  ]
}

`cancel_indexing_job`

Cancel a running indexing job.

Parameters:

job_id (string, required): Job identifier to cancel

Returns:

{
  "success": true,
  "message": "Job abc123 cancelled successfully"
}

Search Tools

`search_code`

Search code using natural language queries with semantic understanding across all indexed repositories.

Parameters:

query (string, required): Natural language search query (e.g., "authentication logic", "error handling")
limit (int): Maximum number of results to return (default: 10)
repo_name (string, optional): Filter by repository name (searches all repos if not specified)
language (string, optional): Filter by programming language (e.g., "python", "typescript", "php")
file_path_filter (string, optional): Filter by file path pattern (e.g., "src/components")
chunk_type (string, optional): Filter by chunk type (e.g., "function", "class", "method")

Returns:

{
  "success": true,
  "query": "authentication logic",
  "total_results": 5,
  "results": [
    {
      "rank": 1,
      "score": 0.8234,
      "repo_name": "backend-api",
      "file": "/workspace/src/auth/login.ts",
      "lines": "42-68",
      "language": "typescript",
      "type": "function",
      "context": "class:AuthService",
      "code": "async function authenticateUser(username, password) { ... }"
    }
  ]
}

`get_symbols`

Extract symbols from a file using AST parsing.

Parameters:

file_path (string): Path to source file
symbol_type (string, optional): Filter by type (e.g., "function", "class")

Returns:

{
  "success": true,
  "file_path": "/workspace/src/utils.py",
  "total_symbols": 15,
  "symbols": [
    {
      "name": "format_date",
      "type": "function_definition",
      "start_line": 42,
      "end_line": 58,
      "context": "N/A",
      "language": "python"
    }
  ]
}

Graph Query Tools

`find_usages`

Find all places where a function, class, or symbol is used across the codebase using the graph database.

Parameters:

symbol_name (string, required): Name of the function/class to find usages for
repo_name (string, optional): Filter by repository name

Returns:

{
  "success": true,
  "symbol_name": "authenticate_user",
  "total_usages": 12,
  "usages": [
    {
      "caller": "LoginController.handleLogin",
      "caller_file": "/workspace/src/controllers/login.ts",
      "line_number": 42,
      "relationship_type": "CALLS"
    }
  ]
}

`find_dependencies`

Find all functions, classes, or imports that a symbol depends on using the graph database.

Parameters:

symbol_name (string, required): Name of the function/class to analyze
repo_name (string, optional): Filter by repository name

Returns:

{
  "success": true,
  "symbol_name": "processPayment",
  "total_dependencies": 8,
  "dependencies": [
    {
      "target": "validateCard",
      "target_file": "/workspace/src/utils/validation.ts",
      "relationship_type": "CALLS",
      "is_external": false
    },
    {
      "target": "stripe.charges.create",
      "relationship_type": "CALLS",
      "is_external": true
    }
  ]
}

`query_graph`

Execute custom Cypher queries against the Neo4j graph database for advanced relationship analysis.

Parameters:

cypher_query (string, required): Cypher query to execute
limit (int, optional): Maximum number of results (default: 100)

Returns:

{
  "success": true,
  "query": "MATCH (f:Function)-[:CALLS]->(ext:ExternalFunction) WHERE ext.name =~ 'wp_.*' RETURN f.name, ext.name",
  "results": [
    {"f.name": "enqueue_scripts", "ext.name": "wp_enqueue_script"},
    {"f.name": "setup_theme", "ext.name": "wp_register_nav_menu"}
  ],
  "total_results": 2
}

Dependency Tools

`detect_dependencies`

Detect available dependencies in the workspace (WordPress plugins/themes, Composer packages, npm modules).

Parameters:

workspace_path (string, optional): Path to workspace (defaults to current workspace)

Returns:

{
  "success": true,
  "dependencies": {
    "wordpress_plugins": ["woocommerce", "advanced-custom-fields"],
    "wordpress_themes": ["twentytwentyfour"],
    "composer_packages": ["symfony/console", "guzzlehttp/guzzle"],
    "npm_packages": ["react", "typescript"]
  },
  "total_dependencies": 6
}

`index_dependencies`

Index specific dependencies into the knowledge base for better understanding of external APIs.

Parameters:

dependency_names (array, required): List of dependency names to index (e.g., ["woocommerce", "react"])
workspace_id (string, required): Unique identifier for the workspace/project
workspace_path (string, optional): Path to workspace

Returns:

{
  "success": true,
  "indexed_dependencies": ["woocommerce"],
  "total_chunks": 1247,
  "message": "Successfully indexed 1 dependencies with 1247 chunks"
}

`list_indexed_dependencies`

List all dependencies that have been indexed in the knowledge base.

Returns:

{
  "success": true,
  "dependencies": [
    {
      "name": "woocommerce",
      "version": "8.5.0",
      "type": "wordpress_plugin",
      "workspaces": ["my-store", "test-site"],
      "chunks_count": 1247,
      "indexed_at": "2024-01-15T10:30:00Z"
    }
  ],
  "total_dependencies": 1
}

Status Tools

`get_indexing_status`

Get statistics about the index, including vector DB, graph DB, and cache metrics.

Returns:

{
  "success": true,
  "code_db": {
    "total_chunks": 2450,
    "vectors_count": 2450,
    "status": "green"
  },
  "knowledge_db": {
    "total_chunks": 1247,
    "indexed_dependencies": ["woocommerce"]
  },
  "graph_db": {
    "enabled": true,
    "total_nodes": 2230,
    "total_relationships": 4407,
    "node_types": {
      "Function": 1459,
      "ExternalFunction": 771
    }
  },
  "index": {
    "indexed_files": 150,
    "total_chunks": 2450
  },
  "cache": {
    "enabled": true,
    "cached_embeddings": 2450,
    "total_size_mb": 18.5
  }
}

`clear_index`

Clear the entire index (useful for fresh start).

`get_watcher_status`

Get status of the real-time file watcher.

Returns:

{
  "success": true,
  "enabled": true,
  "running": true,
  "watch_path": "/workspace",
  "debounce_seconds": 2.0
}

`health_check`

Check health status of all components (Ollama, Qdrant, Neo4j).

Supported Languages

Language	Extensions	Support Level
Python	`.py`, `.pyw`	Full
TypeScript	`.ts`, `.tsx`	Full
JavaScript	`.js`, `.jsx`, `.mjs`, `.cjs`	Full
PHP	`.php`, `.phtml`	Full
Go	`.go`	Full
Rust	`.rs`	Full
Java	`.java`	Full
C++	`.cpp`, `.cc`, `.hpp`, `.hh`	Full
C	`.c`, `.h`	Full
C#	`.cs`	Full

Configuration

Environment Variables

Variable	Default	Description
`CODEBASE_PATH`	`./sample_codebase`	Path to codebase to index
`OLLAMA_HOST`	`http://host.docker.internal:11434`	Ollama API endpoint
`EMBEDDING_MODEL`	`embeddinggemma:latest`	Ollama embedding model to use
`QDRANT_HOST`	`qdrant`	Qdrant server hostname
`QDRANT_PORT`	`6333`	Qdrant server port
`ENABLE_GRAPH_DB`	`false`	Enable Neo4j graph database
`NEO4J_URI`	`bolt://neo4j:7687`	Neo4j connection URI
`NEO4J_USER`	`neo4j`	Neo4j username
`NEO4J_PASSWORD`	`password`	Neo4j password
`INDEX_PATH`	`/index`	Path for index metadata
`CACHE_PATH`	`/cache`	Path for embedding cache
`WORKSPACE_PATH`	`/workspace`	Path to mounted codebase
`MAX_CHUNK_SIZE`	`2048`	Maximum chunk size in characters
`BATCH_SIZE`	`32`	Embedding batch size
`MAX_CONCURRENT_EMBEDDINGS`	`4`	Concurrent embedding requests
`ENABLE_FILE_WATCHER`	`true`	Enable real-time file watching
`WATCHER_DEBOUNCE_SECONDS`	`2.0`	Delay before processing file changes
`LOG_LEVEL`	`INFO`	Logging level

Recommended Embedding Models

embeddinggemma:latest (recommended - best quality)
nomic-embed-text (good balance of speed and quality)
mxbai-embed-large (higher accuracy, slower)
all-minilm (fastest, lower accuracy)

Performance

Indexing Performance

Medium codebase (5K-50K files): 2-10 minutes initial indexing
Incremental updates: 10-60 seconds for typical changes
Cache hit rate: 80-95% on subsequent runs
Embedding generation: ~100-500 chunks/minute (depends on Ollama performance)

Search Performance

Latency: Sub-second semantic search
Throughput: 10-50 queries/second
Accuracy: 30% better than fixed-size chunking (from research)

Troubleshooting

"Ollama health check failed"

Make sure Ollama is running: ollama serve
Pull the embedding model: ollama pull embeddinggemma:latest
Check Docker can access host: Test with curl http://host.docker.internal:11434

"Qdrant connection failed"

Check Qdrant container is running: docker-compose ps
Check Qdrant logs: docker-compose logs qdrant
Restart services: docker-compose restart

"Graph database not enabled"

Set ENABLE_GRAPH_DB=true in your .env file or .mcp.json
Ensure Neo4j environment variables are configured: NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD
Check Neo4j container is running: docker-compose ps
Check Neo4j logs: docker-compose logs neo4j
Test Neo4j connection: docker exec codebase-neo4j cypher-shell -u neo4j -p codebase123 "RETURN 1"

"No supported files found"

Check CODEBASE_PATH is correct in .env
Verify files have supported extensions
Check .gitignore isn't excluding too much

Slow indexing

Reduce BATCH_SIZE if running low on RAM
Increase MAX_CONCURRENT_EMBEDDINGS if you have spare CPU
Use incremental=true for re-indexing

Development

Running Locally (Without Docker)

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export QDRANT_HOST=localhost
export OLLAMA_HOST=http://localhost:11434
export INDEX_PATH=./index
export CACHE_PATH=./cache
export WORKSPACE_PATH=/path/to/your/codebase

# Start Qdrant
docker run -p 6333:6333 qdrant/qdrant

# Run server
python -m src.server

Running Tests

pip install -e ".[dev]"
pytest

Code Quality

# Format code
black src/

# Lint code
ruff src/

Architecture Details

AST-Aware Chunking

The system uses tree-sitter to parse code into Abstract Syntax Trees (ASTs), then extracts semantic chunks that respect:

Function boundaries
Class definitions
Method boundaries
Interface/trait definitions

This achieves 30% better accuracy than fixed-size chunking according to research (arXiv:2506.15655).

Incremental Indexing

Uses Merkle tree-based change detection:

Compute Blake3 hash of each file
Compare with previous state
Only re-index changed files
Update vector database incrementally

Typical cache hit rates: 80-95%

Content-Addressable Storage

Embeddings are cached using content hashing:

cache_key = blake3(model_name + file_content)

This enables:

Team sharing of cached embeddings
Fast re-indexing after git operations
Deterministic caching across machines

Roadmap

Research & References

Based on cutting-edge research in semantic code search:

cAST (arXiv:2506.15655): AST-aware chunking methodology
CodeRAG (arXiv:2504.10046): Graph-augmented retrieval
Model Context Protocol: Anthropic's standard for AI tool integration
Qdrant: High-performance vector database
tree-sitter: Incremental parsing library

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Support

For issues, questions, or feature requests, please open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
docs		docs
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.mcp.json		.mcp.json
.mcp.json.template		.mcp.json.template
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.indexer		Dockerfile.indexer
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

jarmentor/codebase-contextifier-9000

Folders and files

Latest commit

History

Repository files navigation

Codebase Contextifier 9000

Documentation

Table of Contents

Features

Architecture

Key Architectural Features

Quick Start

Prerequisites

Two Deployment Options

Option A: Centralized Server (Recommended)

Option B: Per-Project Setup

Claude Desktop Configuration

Usage

MCP Tools

Indexing Tools

index_repository

get_job_status

list_indexing_jobs

cancel_indexing_job

Search Tools

search_code

get_symbols

Graph Query Tools

find_usages

find_dependencies

query_graph

Dependency Tools

detect_dependencies

index_dependencies

list_indexed_dependencies

Status Tools

get_indexing_status

clear_index

get_watcher_status

health_check

Supported Languages

Configuration

Environment Variables

Recommended Embedding Models

Performance

Indexing Performance

Search Performance

Troubleshooting

"Ollama health check failed"

"Qdrant connection failed"

"Graph database not enabled"

"No supported files found"

Slow indexing

Development

Running Locally (Without Docker)

Running Tests

Code Quality

Architecture Details

AST-Aware Chunking

Incremental Indexing

Content-Addressable Storage

Roadmap

Research & References

License

Contributing

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

`index_repository`

`get_job_status`

`list_indexing_jobs`

`cancel_indexing_job`

`search_code`

`get_symbols`

`find_usages`

`find_dependencies`

`query_graph`

`detect_dependencies`

`index_dependencies`

`list_indexed_dependencies`

`get_indexing_status`

`clear_index`

`get_watcher_status`

`health_check`