Skip to content

Latest commit

 

History

History
257 lines (180 loc) · 9.61 KB

File metadata and controls

257 lines (180 loc) · 9.61 KB

CLAUDE.md

This file provides guidance to Claude Code when working with document-analysis-mcp.


Repository Purpose

MCP (Model Context Protocol) server for general-purpose PDF document analysis. Provides text extraction, LLM-powered analysis, document classification, OCR for scanned documents, structure extraction, and knowledge-bank ingestion. Uses Anthropic's Claude API for intelligent document processing. Deployed on game-da-god for compute separation, API key isolation, and always-on availability.


Directory Structure

document-analysis-mcp/
├── src/document_analysis_mcp/
│   ├── server.py              <- MCP server entry point (Streamable HTTP)
│   ├── config.py              <- Pydantic settings from env vars
│   ├── tools/                 <- MCP tool implementations
│   │   ├── extract.py         <- pdf_extract_full (text + LLM analysis)
│   │   ├── structure.py       <- pdf_extract_structure (TOC, tables, headings)
│   │   ├── classify.py        <- pdf_classify (content type detection)
│   │   ├── ocr.py             <- pdf_ocr (scanned PDF handling via Tesseract)
│   │   └── kb_ingest.py       <- pdf_kb_ingest (one-shot KB ingestion)
│   ├── processors/            <- Document processing backends
│   │   ├── text_extractor.py  <- pdfplumber + pypdf fallback
│   │   ├── llm.py             <- Claude API integration for analysis
│   │   └── chunker.py         <- Multi-page chunking strategies
│   ├── models/                <- Pydantic data models
│   │   └── extraction.py      <- ExtractionResult, PageContent
│   ├── cache/                 <- Hash-based document deduplication cache
│   └── tracking/              <- API usage tracking and cost monitoring
├── tests/                     <- Test suite
├── deploy/                    <- Systemd service, env template
└── pyproject.toml

Installation & Setup

Prerequisites

  • Python 3.10+
  • Tesseract OCR (sudo apt install tesseract-ocr)
  • Anthropic API key

Development Setup

# Clone repository (via ws workflow)
ws start document-analysis-mcp <feature-name>
cd ~/wip/<session-id>/document-analysis-mcp/

# Install with dev dependencies
pip install -e ".[dev]"

Production Deployment

See deploy/document-analysis-mcp.service for full systemd setup. Key steps:

# 1. Run install script to create cache directory at /var/cache/document-analysis-mcp
bash deploy/install.sh

# 2. Install systemd service
sudo cp deploy/document-analysis-mcp.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now document-analysis-mcp

# 3. Register MCP client with Claude Code (one-time, runs as krisoye)
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s user

Configuration

Environment Variables (set in .env or systemd EnvironmentFile):

Variable Default Purpose
ANTHROPIC_API_KEY (required) Anthropic API key for Claude access
DOC_ANALYSIS_HOST 127.0.0.1 Server bind address (use 0.0.0.0 for external)
DOC_ANALYSIS_PORT 8766 Server port
CACHE_DIR /var/cache/document-analysis-mcp Cache directory for extraction results
CACHE_TTL_DAYS 30 Cache expiration in days
DEFAULT_MODEL claude-sonnet-4-20250514 Default Claude model for extraction/analysis
CLASSIFICATION_MODEL claude-3-5-haiku-20241022 Cheaper/faster model for classification
MAX_TOKENS 4096 Maximum tokens for LLM responses
LOG_LEVEL INFO Logging verbosity (DEBUG, INFO, WARNING, ERROR)

See src/document_analysis_mcp/config.py for the complete settings reference.


Common Commands

Command Purpose
python -m document_analysis_mcp.server Run MCP server locally (testing)
pytest Run test suite
mypy src/ Run type checking
ruff check src/ Run linting
ruff format src/ Format code

Deployment:

ws update-prod document-analysis-mcp   # Human only, requires password

Integration Points

Dependencies

  • fastmcp - FastMCP server framework (Streamable HTTP transport)
  • anthropic - Claude API client for LLM analysis
  • pdfplumber - Primary PDF text and table extraction
  • pypdf - PDF fallback extraction and metadata
  • pytesseract - OCR wrapper for Tesseract
  • pydantic / pydantic-settings - Configuration and data models

Dependents

  • Claude Code - MCP tools accessible from any Claude Code session
  • knowledge-bank-tools - PDF processing during KB ingestion via pdf_kb_ingest

MCP Tools

Tool Description
health_check Server health, API key status, cache stats
pdf_extract_full_tool Full text extraction with optional LLM analysis (quick/comprehensive/deep)
pdf_classify_tool Document type classification with analysis strategy recommendation
pdf_ocr_tool OCR extraction for scanned PDFs via Tesseract
pdf_extract_structure_tool Structure extraction (TOC, tables, headings)
pdf_kb_ingest_tool One-shot extraction + classification + chunking for KB ingestion
cache_stats Cache statistics and usage information
usage_summary API usage tracking and cost breakdown

Troubleshooting

Service Crash-Loop (status=226/NAMESPACE)

Symptom: systemctl status document-analysis-mcp shows exit code 226/NAMESPACE.

Root Cause: ProtectHome=tmpfs creates a tmpfs overlay that hides /home entirely. Since the service's WorkingDirectory, virtual environment, .env file, and cache directory all reside under /home/deploy/, systemd cannot start the process.

Solution: Change ProtectHome=tmpfs to ProtectHome=read-only in the service file. This was fixed in PR #17, matching the same fix applied to audio-analysis-mcp.

MCP Server Not Responding

Symptom: Clients timeout connecting to port 8766

Solution:

# Check service status
sudo systemctl status document-analysis-mcp

# Check if port is listening
ss -tlnp | grep 8766

# Check firewall
sudo ufw status | grep 8766

MCP Client Not Registered

Symptom: Claude Code does not show document-analysis tools.

Solution: Register the MCP client:

# Check current registration
claude mcp get document-analysis
# Should show: Type: http, URL: http://localhost:8766/mcp

# Register if missing (Streamable HTTP transport)
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s user

MCP Connection Issues After Server Restart

Symptom: After systemctl restart document-analysis-mcp, MCP tool calls fail with -32602: Invalid request parameters even though /health returns OK.

Background: This was caused by the deprecated SSE transport maintaining long-lived connections with session state. When the server restarted, the client kept using a stale session ID that the new server instance did not recognize. This was fixed by migrating from SSE to Streamable HTTP transport with stateless mode (matching audio-analysis-mcp).

Current transport: Streamable HTTP (stateless) at http://localhost:8766/mcp

If you still see issues: The Claude Code MCP client configuration must use --transport http (not --transport sse). Verify with:

claude mcp get document-analysis
# Should show: Type: http, URL: http://localhost:8766/mcp

To reconfigure (if still using SSE):

claude mcp remove document-analysis -s user
claude mcp add --transport http document-analysis http://localhost:8766/mcp -s user

OCR Fails on Scanned PDFs

Symptom: pdf_ocr returns empty text or error

Solution:

# Verify Tesseract is installed
tesseract --version

# Install if missing
sudo apt install tesseract-ocr

API Rate Limits

Symptom: 429 errors from Anthropic API

Solution: The server has its own API key isolated from Claude Code. Check the Anthropic usage dashboard and consider increasing CACHE_TTL_DAYS to reduce duplicate API calls.


Development Workflow

Authoritative Reference: See MULTI-AGENT-COLLABORATION.md for:

  • Workspace management (ws commands)
  • Git workflow and PR process
  • Production deployment
  • Session and lock management

Remote Deployment: See REMOTE-DEPLOY-SETUP.md for game-da-god deployment architecture.

Current Priorities: See GitHub Issues for development priorities.

Feature Specification: See project-tracker/features/document-analysis-mcp.md for detailed architecture.


Meta

Template: CLAUDE-repo.template.md from project-tracker/templates/

Last validated: 2026-02-06


Changelog

Date Changes
2026-02-06 Migrated cache from /home/deploy/.cache to /var/cache using CacheDirectory=. Added deploy/install.sh. Updated deployment instructions.
2026-02-06 Migrated from SSE to Streamable HTTP transport (stateless). Updated MCP registration commands, troubleshooting, and dependency descriptions.
2026-02-06 Major update: Accurate directory structure, MCP client registration in install steps, complete tool inventory, corrected env vars (model names, MAX_TOKENS), ProtectHome troubleshooting, REMOTE-DEPLOY-SETUP.md reference.
2026-02-01 Initial creation from template