A Docker-based Model Context Protocol (MCP) server for semantic code search with AST-aware chunking, relationship tracking via Neo4j graph database, local LLM support, and incremental indexing.
- 📚 Quick Start Guide - Get running in 5 minutes
- 🔧 Multi-Project Setup - Index multiple projects with shared backend
- ⚙️ Background Jobs - Job-based indexing for large codebases
- 👁️ File Watcher Guide - Real-time monitoring and auto-indexing
- 🔬 Research & Methodology - Deep dive into semantic code search
- 📖 Full Documentation - Complete docs directory
- Features
- Architecture
- Quick Start
- MCP Tools
- Indexing Tools:
index_repository,get_job_status,list_indexing_jobs,cancel_indexing_job - Search Tools:
search_code,get_symbols - Graph Query Tools:
find_usages,find_dependencies,query_graph - Dependency Tools:
detect_dependencies,index_dependencies,list_indexed_dependencies - Status Tools:
get_indexing_status,clear_index,get_watcher_status,health_check
- Indexing Tools:
- Supported Languages
- Configuration
- Performance
- Troubleshooting
- Development
- Architecture Details
- Roadmap
- Research & References
- License
- Contributing
- Support
- AST-Aware Chunking: Uses tree-sitter to respect function and class boundaries, maintaining semantic integrity
- Relationship Tracking: Neo4j graph database tracks function calls, imports, inheritance, and dependencies across your codebase
- External Dependency Mapping: Automatically creates placeholder nodes for external functions (WordPress, npm packages, etc.)
- Job-Based Indexing: Background indexing with progress tracking for large codebases
- On-Demand Container Spawning: Index any repository on your system without manual mounting
- Multi-Repository Search: Index and search across multiple projects with a shared backend
- Real-Time Updates: File system watcher automatically re-indexes changed files (optional)
- Local-First: All processing happens locally using Ollama for embeddings (no data leaves your machine)
- Polyglot Support: Supports 10+ programming languages including TypeScript, Python, PHP, Go, Rust, Java, C++, and more
- Incremental Indexing: Merkle tree-based change detection with 80%+ cache hit rates
- Production-Grade: Uses Qdrant vector database for sub-10ms search latency and Neo4j for relationship queries
- Dependency Knowledge Base: Special collection for indexing WordPress plugins, Composer packages, and npm modules
- Flexible Deployment: Per-project or centralized server deployment options
- MCP Integration: Works with Claude Desktop, Cursor, VS Code, and other MCP-compatible tools
┌─────────────────────────────────────────────────────────────────┐
│ MCP Client (Claude Code, Claude Desktop, Cursor, etc.) │
└──────────────────────────────┬──────────────────────────────────┘
│ MCP Protocol (stdio)
│
┌──────────────────────────────▼──────────────────────────────────┐
│ MCP Server Container (codebase-mcp-server) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ FastMCP Server - Exposes MCP Tools: │ │
│ │ • index_repository (spawns indexer containers) │ │
│ │ • search_code (semantic search across all repos) │ │
│ │ • find_usages, find_dependencies (graph queries) │ │
│ │ • detect_dependencies, index_dependencies │ │
│ │ • get_job_status, list_indexing_jobs, cancel_job │ │
│ │ • get_symbols, get_indexing_status, health_check │ │
│ └──────────────┬───────────────────────────────────────────┘ │
│ │ │
│ │ Spawns via Docker Socket │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ On-Demand Indexer Containers (ephemeral) │ │
│ │ • Mounts any host directory │ │
│ │ • AST-aware chunking with tree-sitter │ │
│ │ • Extracts relationships (CALLS, IMPORTS, etc.) │ │
│ │ • Generates embeddings via Ollama │ │
│ │ • Updates shared Qdrant & Neo4j databases │ │
│ │ • Reports progress back to MCP server │ │
│ └──────────────────────┬──────────────────────────────┘ │
└─────────────────────────┼──────────────────────────────────────┘
│
┌───────────────┴───────────────────┐
│ │
┌──────▼──────┐ ┌────────────▼────────┐
│ Qdrant │ │ Neo4j │
│ Container │ │ Container │
│ (Vectors) │ │ (Relationships) │
└──────┬──────┘ └──────┬──────────────┘
│ │
┌──────▼────────────────────────────▼───────────┐
│ Persistent Docker Volumes: │
│ • qdrant_data (vector DB) │
│ • neo4j_data (graph DB) │
│ • index_data (merkle trees) │
│ • cache_data (embeddings cache) │
└───────────────────────────────────────────────┘
┌────────────────────────┐
│ Ollama (Host) │
│ Embedding Model │
└────────────────────────┘
- Dual Database Architecture: Qdrant for semantic vector search, Neo4j for relationship graph queries
- Container Orchestration: MCP server spawns lightweight indexer containers on-demand via Docker socket
- Multi-Repository Support: Each repository gets its own merkle tree state, but shares the vector & graph databases
- Shared Backend: All projects use the same Qdrant & Neo4j instances, enabling cross-repository search and relationship tracking
- Job-Based Processing: Background jobs with progress tracking for large codebases
- Content-Addressable Caching: Embeddings are cached by content hash, shared across all repositories
- Relationship Extraction: AST-based extraction of CALLS, IMPORTS, EXTENDS, and IMPLEMENTS relationships
- External Dependency Tracking: Automatic creation of placeholder nodes for unresolved function calls
See QUICK_START.md for detailed setup instructions.
- Docker Desktop (or Docker + Docker Compose)
- Ollama running locally with an embedding model:
# Install Ollama: https://ollama.ai # Recommended: Google's Gemma embedding model (best quality) ollama pull embeddinggemma:latest # Alternative: Nomic Embed (faster, smaller) ollama pull nomic-embed-text
Best for: Indexing from the MCP server, querying across all repositories
# 1. Start the backend
cd codebase-contextifier-9000
docker-compose up -d
# 2. Configure Claude Desktop (see below)
# 3. Index any repository
# In Claude: "Index the repository at /Users/me/projects/my-app"Best for: Each project manages its own indexing
# 1. Start shared backend (once)
cd codebase-contextifier-9000
docker-compose up -d
# 2. Copy .mcp.json to each project
cp .mcp.json.template ~/projects/my-app/.mcp.json
# 3. Open project in Claude Code
cd ~/projects/my-app
claude-code .See MULTI_PROJECT_SETUP.md for details.
For Centralized Server (Option A):
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"codebase-contextifier": {
"command": "docker",
"args": [
"exec",
"-i",
"codebase-mcp-server",
"python",
"-m",
"src.server"
]
}
}
}For Per-Project Setup (Option B):
Just copy .mcp.json.template to your project directory - no manual configuration needed!
Once configured, you can use these tools in Claude Desktop or Claude Code:
Index any repository on your system:
Claude, index the repository at /Users/me/projects/my-app
The system spawns a container, indexes the repository in the background, and reports progress.
Monitor indexing progress:
Claude, show me the status of job abc123
Search for code across all indexed repositories:
Claude, search for "authentication logic" in the codebase
Search with filters:
Claude, search for "error handling" filtering by language=python and repo_name=my-api
Extract symbols from a file:
Claude, get all functions from /workspace/src/utils.py
Find all usages of a function (graph query):
Claude, find all places where authenticate_user is called
Find dependencies of a function (graph query):
Claude, show me all functions that processPayment depends on
Detect and index external dependencies:
Claude, detect available WordPress plugins in this project
Claude, index the woocommerce plugin into the knowledge base
Check system status:
Claude, show me the indexing status and list all jobs
Index a repository from any directory on your host machine by spawning a lightweight indexer container.
Parameters:
host_path(string, required): Absolute path on host machine to repository (e.g.,/Users/me/projects/my-app)repo_name(string, optional): Unique identifier for this repository (defaults to directory name)incremental(bool): Use incremental indexing to only re-index changed files (default:true)exclude_patterns(string, optional): Comma-separated glob patterns to exclude (e.g.,"node_modules/*,dist/*")
Returns:
{
"success": true,
"job_id": "abc123def456",
"repo_name": "my-app",
"status": "queued",
"message": "Background indexing started for 'my-app'"
}Example:
# Index a WordPress site, excluding plugins and uploads
await index_repository(
host_path="/Users/me/sites/my-wordpress",
repo_name="my-wordpress",
exclude_patterns="wp-content/plugins/*,wp-content/uploads/*,wp-includes/*"
)Get the status and progress of an indexing job.
Parameters:
job_id(string, required): Job identifier returned fromindex_repository
Returns:
{
"success": true,
"job_id": "abc123def456",
"repo_name": "my-app",
"repo_path": "/Users/me/projects/my-app",
"status": "running",
"created_at": 1698765432.123,
"started_at": 1698765433.456,
"elapsed_seconds": 45.2,
"progress": {
"current_file": 45,
"total_files": 100,
"progress_pct": 45.0,
"current_file_path": "/workspace/src/api/auth.py",
"chunks_indexed": 234,
"failed_files_count": 2,
"cache_hit_rate": "35.50%"
}
}Status values: "queued", "running", "completed", "failed", "cancelled"
List all indexing jobs (past and present).
Returns:
{
"success": true,
"total_jobs": 3,
"jobs": [
{
"job_id": "abc123",
"repo_name": "my-api",
"status": "completed",
"progress": { "progress_pct": 100.0, ... }
},
{
"job_id": "def456",
"repo_name": "frontend",
"status": "running",
"progress": { "progress_pct": 67.5, ... }
}
]
}Cancel a running indexing job.
Parameters:
job_id(string, required): Job identifier to cancel
Returns:
{
"success": true,
"message": "Job abc123 cancelled successfully"
}Search code using natural language queries with semantic understanding across all indexed repositories.
Parameters:
query(string, required): Natural language search query (e.g., "authentication logic", "error handling")limit(int): Maximum number of results to return (default: 10)repo_name(string, optional): Filter by repository name (searches all repos if not specified)language(string, optional): Filter by programming language (e.g., "python", "typescript", "php")file_path_filter(string, optional): Filter by file path pattern (e.g., "src/components")chunk_type(string, optional): Filter by chunk type (e.g., "function", "class", "method")
Returns:
{
"success": true,
"query": "authentication logic",
"total_results": 5,
"results": [
{
"rank": 1,
"score": 0.8234,
"repo_name": "backend-api",
"file": "/workspace/src/auth/login.ts",
"lines": "42-68",
"language": "typescript",
"type": "function",
"context": "class:AuthService",
"code": "async function authenticateUser(username, password) { ... }"
}
]
}Extract symbols from a file using AST parsing.
Parameters:
file_path(string): Path to source filesymbol_type(string, optional): Filter by type (e.g.,"function","class")
Returns:
{
"success": true,
"file_path": "/workspace/src/utils.py",
"total_symbols": 15,
"symbols": [
{
"name": "format_date",
"type": "function_definition",
"start_line": 42,
"end_line": 58,
"context": "N/A",
"language": "python"
}
]
}Find all places where a function, class, or symbol is used across the codebase using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to find usages forrepo_name(string, optional): Filter by repository name
Returns:
{
"success": true,
"symbol_name": "authenticate_user",
"total_usages": 12,
"usages": [
{
"caller": "LoginController.handleLogin",
"caller_file": "/workspace/src/controllers/login.ts",
"line_number": 42,
"relationship_type": "CALLS"
}
]
}Find all functions, classes, or imports that a symbol depends on using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to analyzerepo_name(string, optional): Filter by repository name
Returns:
{
"success": true,
"symbol_name": "processPayment",
"total_dependencies": 8,
"dependencies": [
{
"target": "validateCard",
"target_file": "/workspace/src/utils/validation.ts",
"relationship_type": "CALLS",
"is_external": false
},
{
"target": "stripe.charges.create",
"relationship_type": "CALLS",
"is_external": true
}
]
}Execute custom Cypher queries against the Neo4j graph database for advanced relationship analysis.
Parameters:
cypher_query(string, required): Cypher query to executelimit(int, optional): Maximum number of results (default: 100)
Returns:
{
"success": true,
"query": "MATCH (f:Function)-[:CALLS]->(ext:ExternalFunction) WHERE ext.name =~ 'wp_.*' RETURN f.name, ext.name",
"results": [
{"f.name": "enqueue_scripts", "ext.name": "wp_enqueue_script"},
{"f.name": "setup_theme", "ext.name": "wp_register_nav_menu"}
],
"total_results": 2
}Detect available dependencies in the workspace (WordPress plugins/themes, Composer packages, npm modules).
Parameters:
workspace_path(string, optional): Path to workspace (defaults to current workspace)
Returns:
{
"success": true,
"dependencies": {
"wordpress_plugins": ["woocommerce", "advanced-custom-fields"],
"wordpress_themes": ["twentytwentyfour"],
"composer_packages": ["symfony/console", "guzzlehttp/guzzle"],
"npm_packages": ["react", "typescript"]
},
"total_dependencies": 6
}Index specific dependencies into the knowledge base for better understanding of external APIs.
Parameters:
dependency_names(array, required): List of dependency names to index (e.g.,["woocommerce", "react"])workspace_id(string, required): Unique identifier for the workspace/projectworkspace_path(string, optional): Path to workspace
Returns:
{
"success": true,
"indexed_dependencies": ["woocommerce"],
"total_chunks": 1247,
"message": "Successfully indexed 1 dependencies with 1247 chunks"
}List all dependencies that have been indexed in the knowledge base.
Returns:
{
"success": true,
"dependencies": [
{
"name": "woocommerce",
"version": "8.5.0",
"type": "wordpress_plugin",
"workspaces": ["my-store", "test-site"],
"chunks_count": 1247,
"indexed_at": "2024-01-15T10:30:00Z"
}
],
"total_dependencies": 1
}Get statistics about the index, including vector DB, graph DB, and cache metrics.
Returns:
{
"success": true,
"code_db": {
"total_chunks": 2450,
"vectors_count": 2450,
"status": "green"
},
"knowledge_db": {
"total_chunks": 1247,
"indexed_dependencies": ["woocommerce"]
},
"graph_db": {
"enabled": true,
"total_nodes": 2230,
"total_relationships": 4407,
"node_types": {
"Function": 1459,
"ExternalFunction": 771
}
},
"index": {
"indexed_files": 150,
"total_chunks": 2450
},
"cache": {
"enabled": true,
"cached_embeddings": 2450,
"total_size_mb": 18.5
}
}Clear the entire index (useful for fresh start).
Get status of the real-time file watcher.
Returns:
{
"success": true,
"enabled": true,
"running": true,
"watch_path": "/workspace",
"debounce_seconds": 2.0
}Check health status of all components (Ollama, Qdrant, Neo4j).
| Language | Extensions | Support Level |
|---|---|---|
| Python | .py, .pyw |
Full |
| TypeScript | .ts, .tsx |
Full |
| JavaScript | .js, .jsx, .mjs, .cjs |
Full |
| PHP | .php, .phtml |
Full |
| Go | .go |
Full |
| Rust | .rs |
Full |
| Java | .java |
Full |
| C++ | .cpp, .cc, .hpp, .hh |
Full |
| C | .c, .h |
Full |
| C# | .cs |
Full |
| Variable | Default | Description |
|---|---|---|
CODEBASE_PATH |
./sample_codebase |
Path to codebase to index |
OLLAMA_HOST |
http://host.docker.internal:11434 |
Ollama API endpoint |
EMBEDDING_MODEL |
embeddinggemma:latest |
Ollama embedding model to use |
QDRANT_HOST |
qdrant |
Qdrant server hostname |
QDRANT_PORT |
6333 |
Qdrant server port |
ENABLE_GRAPH_DB |
false |
Enable Neo4j graph database |
NEO4J_URI |
bolt://neo4j:7687 |
Neo4j connection URI |
NEO4J_USER |
neo4j |
Neo4j username |
NEO4J_PASSWORD |
password |
Neo4j password |
INDEX_PATH |
/index |
Path for index metadata |
CACHE_PATH |
/cache |
Path for embedding cache |
WORKSPACE_PATH |
/workspace |
Path to mounted codebase |
MAX_CHUNK_SIZE |
2048 |
Maximum chunk size in characters |
BATCH_SIZE |
32 |
Embedding batch size |
MAX_CONCURRENT_EMBEDDINGS |
4 |
Concurrent embedding requests |
ENABLE_FILE_WATCHER |
true |
Enable real-time file watching |
WATCHER_DEBOUNCE_SECONDS |
2.0 |
Delay before processing file changes |
LOG_LEVEL |
INFO |
Logging level |
embeddinggemma:latest(recommended - best quality)nomic-embed-text(good balance of speed and quality)mxbai-embed-large(higher accuracy, slower)all-minilm(fastest, lower accuracy)
- Medium codebase (5K-50K files): 2-10 minutes initial indexing
- Incremental updates: 10-60 seconds for typical changes
- Cache hit rate: 80-95% on subsequent runs
- Embedding generation: ~100-500 chunks/minute (depends on Ollama performance)
- Latency: Sub-second semantic search
- Throughput: 10-50 queries/second
- Accuracy: 30% better than fixed-size chunking (from research)
- Make sure Ollama is running:
ollama serve - Pull the embedding model:
ollama pull embeddinggemma:latest - Check Docker can access host: Test with
curl http://host.docker.internal:11434
- Check Qdrant container is running:
docker-compose ps - Check Qdrant logs:
docker-compose logs qdrant - Restart services:
docker-compose restart
- Set
ENABLE_GRAPH_DB=truein your.envfile or.mcp.json - Ensure Neo4j environment variables are configured:
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD - Check Neo4j container is running:
docker-compose ps - Check Neo4j logs:
docker-compose logs neo4j - Test Neo4j connection:
docker exec codebase-neo4j cypher-shell -u neo4j -p codebase123 "RETURN 1"
- Check
CODEBASE_PATHis correct in.env - Verify files have supported extensions
- Check
.gitignoreisn't excluding too much
- Reduce
BATCH_SIZEif running low on RAM - Increase
MAX_CONCURRENT_EMBEDDINGSif you have spare CPU - Use
incremental=truefor re-indexing
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export QDRANT_HOST=localhost
export OLLAMA_HOST=http://localhost:11434
export INDEX_PATH=./index
export CACHE_PATH=./cache
export WORKSPACE_PATH=/path/to/your/codebase
# Start Qdrant
docker run -p 6333:6333 qdrant/qdrant
# Run server
python -m src.serverpip install -e ".[dev]"
pytest# Format code
black src/
# Lint code
ruff src/The system uses tree-sitter to parse code into Abstract Syntax Trees (ASTs), then extracts semantic chunks that respect:
- Function boundaries
- Class definitions
- Method boundaries
- Interface/trait definitions
This achieves 30% better accuracy than fixed-size chunking according to research (arXiv:2506.15655).
Uses Merkle tree-based change detection:
- Compute Blake3 hash of each file
- Compare with previous state
- Only re-index changed files
- Update vector database incrementally
Typical cache hit rates: 80-95%
Embeddings are cached using content hashing:
cache_key = blake3(model_name + file_content)
This enables:
- Team sharing of cached embeddings
- Fast re-indexing after git operations
- Deterministic caching across machines
- Real-time file system watcher for instant updates
- Multi-repo search with shared backend
- Job-based background indexing with progress tracking
- On-demand container spawning for flexible repository indexing
- Neo4j integration for relationship tracking - Track function calls, imports, inheritance, with external dependency placeholders
- Dependency knowledge base - Index WordPress plugins, Composer packages, npm modules
- Reranking with cross-encoders for improved accuracy
- Fine-tuned embeddings for domain-specific code
- HTTP transport for remote MCP servers
- Web UI for search and visualization
- Graph-based code navigation UI (Neo4j Browser or custom visualization)
Based on cutting-edge research in semantic code search:
- cAST (arXiv:2506.15655): AST-aware chunking methodology
- CodeRAG (arXiv:2504.10046): Graph-augmented retrieval
- Model Context Protocol: Anthropic's standard for AI tool integration
- Qdrant: High-performance vector database
- tree-sitter: Incremental parsing library
MIT
Contributions welcome! Please open an issue or PR.
For issues, questions, or feature requests, please open a GitHub issue.