This file provides guidance to Claude Code when working with document-analysis-mcp.
MCP (Model Context Protocol) server for general-purpose PDF document analysis. Provides text extraction, LLM-powered analysis, document classification, OCR for scanned documents, structure extraction, and knowledge-bank ingestion. Uses Anthropic's Claude API for intelligent document processing. Deployed on game-da-god for compute separation, API key isolation, and always-on availability.
document-analysis-mcp/
├── src/document_analysis_mcp/
│ ├── server.py <- MCP server entry point (Streamable HTTP)
│ ├── config.py <- Pydantic settings from env vars
│ ├── tools/ <- MCP tool implementations
│ │ ├── extract.py <- pdf_extract_full (text + LLM analysis)
│ │ ├── structure.py <- pdf_extract_structure (TOC, tables, headings)
│ │ ├── classify.py <- pdf_classify (content type detection)
│ │ ├── ocr.py <- pdf_ocr (scanned PDF handling via Tesseract)
│ │ └── kb_ingest.py <- pdf_kb_ingest (one-shot KB ingestion)
│ ├── processors/ <- Document processing backends
│ │ ├── text_extractor.py <- pdfplumber + pypdf fallback
│ │ ├── llm.py <- Claude API integration for analysis
│ │ └── chunker.py <- Multi-page chunking strategies
│ ├── models/ <- Pydantic data models
│ │ └── extraction.py <- ExtractionResult, PageContent
│ ├── cache/ <- Hash-based document deduplication cache
│ └── tracking/ <- API usage tracking and cost monitoring
├── tests/ <- Test suite
├── deploy/ <- Systemd service, env template
└── pyproject.toml
- Python 3.10+
- Tesseract OCR (
sudo apt install tesseract-ocr) - Anthropic API key
# Clone repository (via ws workflow)
ws start document-analysis-mcp <feature-name>
cd ~/wip/<session-id>/document-analysis-mcp/
# Install with dev dependencies
pip install -e ".[dev]"See deploy/document-analysis-mcp.service for full systemd setup. Key steps:
# 1. Run install script to create cache directory at /var/cache/document-analysis-mcp
bash deploy/install.sh
# 2. Install systemd service
sudo cp deploy/document-analysis-mcp.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now document-analysis-mcp
# 3. Register MCP client with Claude Code (one-time, runs as krisoye)
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s userEnvironment Variables (set in .env or systemd EnvironmentFile):
| Variable | Default | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key for Claude access |
DOC_ANALYSIS_HOST |
127.0.0.1 |
Server bind address (use 0.0.0.0 for external) |
DOC_ANALYSIS_PORT |
8766 |
Server port |
CACHE_DIR |
/var/cache/document-analysis-mcp |
Cache directory for extraction results |
CACHE_TTL_DAYS |
30 |
Cache expiration in days |
DEFAULT_MODEL |
claude-sonnet-4-20250514 |
Default Claude model for extraction/analysis |
CLASSIFICATION_MODEL |
claude-3-5-haiku-20241022 |
Cheaper/faster model for classification |
MAX_TOKENS |
4096 |
Maximum tokens for LLM responses |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG, INFO, WARNING, ERROR) |
See src/document_analysis_mcp/config.py for the complete settings reference.
| Command | Purpose |
|---|---|
python -m document_analysis_mcp.server |
Run MCP server locally (testing) |
pytest |
Run test suite |
mypy src/ |
Run type checking |
ruff check src/ |
Run linting |
ruff format src/ |
Format code |
Deployment:
ws update-prod document-analysis-mcp # Human only, requires password- fastmcp - FastMCP server framework (Streamable HTTP transport)
- anthropic - Claude API client for LLM analysis
- pdfplumber - Primary PDF text and table extraction
- pypdf - PDF fallback extraction and metadata
- pytesseract - OCR wrapper for Tesseract
- pydantic / pydantic-settings - Configuration and data models
- Claude Code - MCP tools accessible from any Claude Code session
- knowledge-bank-tools - PDF processing during KB ingestion via
pdf_kb_ingest
| Tool | Description |
|---|---|
health_check |
Server health, API key status, cache stats |
pdf_extract_full_tool |
Full text extraction with optional LLM analysis (quick/comprehensive/deep) |
pdf_classify_tool |
Document type classification with analysis strategy recommendation |
pdf_ocr_tool |
OCR extraction for scanned PDFs via Tesseract |
pdf_extract_structure_tool |
Structure extraction (TOC, tables, headings) |
pdf_kb_ingest_tool |
One-shot extraction + classification + chunking for KB ingestion |
cache_stats |
Cache statistics and usage information |
usage_summary |
API usage tracking and cost breakdown |
Symptom: systemctl status document-analysis-mcp shows exit code 226/NAMESPACE.
Root Cause: ProtectHome=tmpfs creates a tmpfs overlay that hides /home entirely. Since the service's WorkingDirectory, virtual environment, .env file, and cache directory all reside under /home/deploy/, systemd cannot start the process.
Solution: Change ProtectHome=tmpfs to ProtectHome=read-only in the service file. This was fixed in PR #17, matching the same fix applied to audio-analysis-mcp.
Symptom: Clients timeout connecting to port 8766
Solution:
# Check service status
sudo systemctl status document-analysis-mcp
# Check if port is listening
ss -tlnp | grep 8766
# Check firewall
sudo ufw status | grep 8766Symptom: Claude Code does not show document-analysis tools.
Solution: Register the MCP client:
# Check current registration
claude mcp get document-analysis
# Should show: Type: http, URL: http://localhost:8766/mcp
# Register if missing (Streamable HTTP transport)
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s userSymptom: After systemctl restart document-analysis-mcp, MCP tool calls fail with -32602: Invalid request parameters even though /health returns OK.
Background: This was caused by the deprecated SSE transport maintaining long-lived connections with session state. When the server restarted, the client kept using a stale session ID that the new server instance did not recognize. This was fixed by migrating from SSE to Streamable HTTP transport with stateless mode (matching audio-analysis-mcp).
Current transport: Streamable HTTP (stateless) at http://localhost:8766/mcp
If you still see issues: The Claude Code MCP client configuration must use --transport http (not --transport sse). Verify with:
claude mcp get document-analysis
# Should show: Type: http, URL: http://localhost:8766/mcpTo reconfigure (if still using SSE):
claude mcp remove document-analysis -s user
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s userSymptom: pdf_ocr returns empty text or error
Solution:
# Verify Tesseract is installed
tesseract --version
# Install if missing
sudo apt install tesseract-ocrSymptom: 429 errors from Anthropic API
Solution: The server has its own API key isolated from Claude Code. Check the Anthropic usage dashboard and consider increasing CACHE_TTL_DAYS to reduce duplicate API calls.
Authoritative Reference: See MULTI-AGENT-COLLABORATION.md for:
- Workspace management (
wscommands) - Git workflow and PR process
- Production deployment
- Session and lock management
Remote Deployment: See REMOTE-DEPLOY-SETUP.md for game-da-god deployment architecture.
Current Priorities: See GitHub Issues for development priorities.
Feature Specification: See project-tracker/features/document-analysis-mcp.md for detailed architecture.
Template: CLAUDE-repo.template.md from project-tracker/templates/
Last validated: 2026-02-06
| Date | Changes |
|---|---|
| 2026-02-06 | Migrated cache from /home/deploy/.cache to /var/cache using CacheDirectory=. Added deploy/install.sh. Updated deployment instructions. |
| 2026-02-06 | Migrated from SSE to Streamable HTTP transport (stateless). Updated MCP registration commands, troubleshooting, and dependency descriptions. |
| 2026-02-06 | Major update: Accurate directory structure, MCP client registration in install steps, complete tool inventory, corrected env vars (model names, MAX_TOKENS), ProtectHome troubleshooting, REMOTE-DEPLOY-SETUP.md reference. |
| 2026-02-01 | Initial creation from template |