Technical Specification

Version note: This document describes the external tool contract, data model, operational behavior, and implementation-aligned semantics of jCodeMunch-MCP at a high level. It is intended for engineers, integrators, evaluators, and technical stakeholders who need a precise understanding of how the system behaves. For broader architectural context, see ARCHITECTURE.md. For end-user workflows and setup, see USER_GUIDE.md.

Overview
Core Operating Model
Tool Surface
Data Models
- Symbol
- CodeIndex
Repository Acquisition and File Discovery
Indexing Semantics
Watcher and Index Freshness
Search and Ranking Semantics
Retrieval Semantics
Response Envelope
Error Handling
Transport Modes and CLI
Environment Variables
Security and Safety Controls
Performance and Operational Notes
Token Savings Semantics
Compatibility and Evolution Notes

Overview

jCodeMunch-MCP pre-indexes repository source code using tree-sitter AST parsing and builds a structured catalog of symbols such as functions, classes, methods, constants, and types. Each symbol stores structured metadata including its signature, summary, location, and byte offsets into cached raw file content. Full source is then retrievable on demand through direct byte-offset access rather than repeated full-file reads.

The system is designed to support AI agents and MCP-compatible clients that need repository navigation, symbol lookup, targeted code retrieval, and bounded context assembly while minimizing unnecessary token consumption.

At a high level, the contract is:

index a repository or local folder
inspect structure through outlines or trees
search for relevant symbols or text
retrieve only the required code or contextual bundle
optionally verify freshness or compute relationships and impact

Core Operating Model

jCodeMunch operates as a local-first structured retrieval layer.

Its core behaviors are:

parse source files into a normalized symbol index
persist symbol metadata separately from raw file contents
retrieve precise source segments by byte offset
expose those capabilities through a stable MCP tool surface
report operational metadata through _meta envelopes

The design assumes that repeated code exploration should be driven by structured navigation and targeted retrieval rather than by repeatedly loading large files into model context.

Tool Surface

The tool surface is best described by capability domain rather than by a fixed historical count.

Indexing and Repository Management

`index_repo` — Index a GitHub repository

{
  "url": "owner/repo",
  "use_ai_summaries": true
}

Indexes a GitHub repository by enumerating source files, applying the security and filtering pipeline, parsing with tree-sitter, generating summaries, and persisting both the index and cached raw files.

Behavioral notes:

accepts repository identifiers such as owner/repo
can use AI-generated summaries when configured
stores repository metadata, language counts, symbol records, file hashes, and cached raw content
may short-circuit unchanged reindex runs when repository metadata indicates no relevant changes

`index_folder` — Index a local folder

{
  "path": "/path/to/project",
  "extra_ignore_patterns": ["*.generated.*"],
  "follow_symlinks": false
}

Indexes a local project folder using the same parsing and persistence model as remote indexing, with additional local-path safety controls.

Behavioral notes:

performs recursive discovery with path and symlink protections
respects .gitignore and additional ignore patterns
can auto-detect supported ecosystem tools and apply context-provider enrichment
may return context-enrichment statistics when providers are active

`index_file` — Re-index a single file

{
  "path": "/absolute/path/to/file.py",
  "use_ai_summaries": false,
  "context_providers": true
}

Re-indexes a single file without touching the rest of the index. Locates the owning index by scanning source_root of all indexed repos and selecting the most specific match. Exits early if the file's hash is unchanged.

Behavioral notes:

requires the file's parent folder to already be indexed via index_folder
validates security (path must be within a known source_root)
checks mtime and hash — skips parse/save if file is unchanged
parses with tree-sitter, runs context providers, and writes a surgical incremental update
faster than re-running index_folder for single-file edits

`embed_repo` — Precompute symbol embeddings for semantic search

{
  "repo": "owner/repo",
  "batch_size": 50,
  "force": false
}

Precomputes and caches all symbol embeddings in one pass. This is an optional warm-up step; search_symbols(semantic=true) also computes embeddings lazily on first use.

Behavioral notes:

embeddings are stored in a symbol_embeddings SQLite table in the per-repo .db file
serialized as float32 BLOBs via stdlib array module; no numpy required
force=true recomputes even when cached embeddings exist
requires an embedding provider (JCODEMUNCH_EMBED_MODEL, OPENAI_API_KEY + OPENAI_EMBED_MODEL, or GOOGLE_API_KEY + GOOGLE_EMBED_MODEL)

`invalidate_cache` — Delete index for a repository

{
  "repo": "owner/repo"
}

Deletes the persisted index and cached raw content associated with the specified repository identifier.

Behavioral notes:

removes both metadata and cached content
is typically used when the index is stale, corrupted, or intentionally reset

`list_repos` — List indexed repositories

No input required.

Returns all indexed repositories known to the local store, together with summary metadata such as symbol counts, file counts, languages, index version, and optional display metadata.

`resolve_repo` — Resolve a path to a repo identifier

{
  "path": "/absolute/path/to/project"
}

O(1) lookup that resolves a filesystem path to its indexed repo identifier. Accepts repo roots, worktrees, subdirectories, or file paths. Computes the deterministic repo ID from the path hash and checks index existence directly — far cheaper than list_repos when you only need one repo.

Returns indexed: true with full metadata (symbol count, file count, languages, etc.) if found, or indexed: false with the computed repo ID and a hint to call index_folder if not.

Behavioral notes:

Preferred over list_repos when the client knows its working directory
Tries the input path first, then walks up to the git root for subdirectory lookups
Returns ~200 tokens vs. potentially thousands from list_repos

Discovery and Repository Inspection

`get_file_tree` — Get file structure

{
  "repo": "owner/repo",
  "path_prefix": "src/"
}

Returns a nested directory tree with file-level annotations such as language and symbol count where available.

Behavioral notes:

useful for structural exploration before retrieving any source
path_prefix can be used to scope the returned subtree

`get_file_outline` — Get symbols in a file

{
  "repo": "owner/repo",
  "file_path": "src/main.py"
}

Returns a hierarchical symbol tree for a file. Parent-child relationships such as class-to-method are preserved where the language and parser support them.

Behavioral notes:

includes signatures and summaries
does not include full source
intended as a lightweight inspection tool before get_symbol_source

`get_file_content` — Get cached file content

{
  "repo": "owner/repo",
  "file_path": "src/main.py",
  "start_line": 10,
  "end_line": 30
}

Returns raw cached file content from the local store.

Behavioral notes:

optional start_line and end_line are 1-based inclusive
line ranges are clamped to file bounds
intended for line-oriented retrieval when symbol retrieval is not appropriate

`get_repo_outline` — High-level repository overview

{
  "repo": "owner/repo"
}

Returns repository-level summary information such as file counts by directory, language breakdown, symbol-kind distribution, most-imported files, and most central symbols by PageRank.

Behavioral notes:

lighter than get_file_tree
intended as a coarse-grained entry point for unfamiliar repositories
includes most_central_symbols (top 10 by PageRank score) alongside the existing most_imported_files
_meta.is_stale reflects whether the git HEAD has moved since index time

`suggest_queries` — Suggest high-value initial queries

{
  "repo": "owner/repo"
}

Returns guidance for exploring unfamiliar repositories, including useful keywords, common entry points, frequently imported files, distributions, and candidate follow-up queries.

Behavioral notes:

intended to reduce cold-start friction
useful when users or agents do not yet know symbol names or subsystem terminology

Retrieval

`get_symbol_source` — Get full source of one or more symbols

Single symbol — returns flat symbol object:

{
  "repo": "owner/repo",
  "symbol_id": "src/main.py::MyClass.login#method",
  "verify": true,
  "context_lines": 3
}

Batch — returns {symbols, errors}:

{
  "repo": "owner/repo",
  "symbol_ids": ["id1", "id2", "id3"]
}

Behavioral notes:

retrieval is based on cached raw file content, not reparsing
symbol_id and symbol_ids are mutually exclusive — passing both is an error
verify re-hashes the retrieved source and compares it with the stored content_hash; applies to all symbols in batch mode
context_lines optionally adds surrounding lines; applies to all symbols in batch mode
in batch mode, missing symbols are reported in errors[] without causing other lookups to fail

`get_context_bundle` — Retrieve a bounded contextual package around a symbol or symbol set

{
  "repo": "owner/repo",
  "symbol_id": "src/auth.py::AuthService.login#method",
  "token_budget": 4000,
  "budget_strategy": "most_relevant",
  "include_budget_report": false
}

Returns a bounded retrieval package designed to support downstream reasoning tasks without requiring a series of separate calls.

Behavioral notes:

includes the target symbol, related imports, deduplicated dependencies, and optional callers
supports multi-symbol bundles via symbol_ids[]
token_budget (int) — when set, symbols are ranked and trimmed to fit; fully backward-compatible (omit to get existing behavior)
budget_strategy: "most_relevant" (default, ranks by import in-degree), "core_first" (primary symbol first, imports ranked by centrality), "compact" (signatures only, no bodies)
include_budget_report=true adds a budget_report field with budget_tokens, used_tokens, included_symbols, excluded_symbols, and strategy

`get_ranked_context` — Query-driven token-budgeted context assembler

{
  "repo": "owner/repo",
  "query": "authentication flow",
  "token_budget": 4000,
  "strategy": "combined"
}

Returns the best-fit symbols for a task description, ranked by relevance and centrality, greedily packed to fit within the token budget.

Behavioral notes:

strategy: "combined" (BM25 + PageRank weighted sum, default), "bm25" (pure text relevance), "centrality" (PageRank only)
include_kinds (list) — restrict candidates to specific symbol kinds
scope (string) — restrict candidates to a subdirectory
per-item response includes relevance_score, centrality_score, combined_score, tokens, and source
token counting uses len(text) // 4 heuristic; upgrades to tiktoken if installed (no hard dep)

Search

`search_symbols` — Search across indexed symbols

{
  "repo": "owner/repo",
  "query": "authenticate",
  "kind": "function",
  "language": "python",
  "file_pattern": "src/**/*.py",
  "max_results": 10,
  "fuzzy": false,
  "sort_by": "relevance",
  "semantic": false
}

Searches the symbol index using a structured ranking pipeline.

Behavioral notes:

all filters are optional
supports narrowing by symbol kind, language, and file path pattern
ranking uses multiple lexical and metadata signals rather than a single naive match rule
fuzzy matching — fuzzy=true enables a trigram Jaccard + Levenshtein fallback when BM25 confidence is low (top score < 0.1) or when explicitly requested. Fuzzy results carry match_type="fuzzy", fuzzy_similarity, and edit_distance fields. Zero behavioral change when fuzzy=false (default).
centrality-aware ranking — sort_by: "relevance" (default, BM25), "centrality" (filter by query match, rank by PageRank), "combined" (BM25 + PageRank weighted sum)
semantic search — semantic=true enables hybrid BM25 + embedding ranking. Requires an embedding provider. semantic_weight (float, default 0.5) controls the blend. semantic_only=true skips BM25 entirely. semantic=false (default) has zero performance impact and zero new imports. semantic=true with no provider configured returns a structured error (error: "no_embedding_provider").
intended as the primary entry point for locating code by meaningfully named program elements

`search_text` — Full-text search across cached file contents

{
  "repo": "owner/repo",
  "query": "TODO",
  "file_pattern": "*.py",
  "max_results": 20,
  "context_lines": 2
}

Performs case-insensitive text search across indexed file contents.

Behavioral notes:

intended for comments, strings, configuration values, TODO markers, or other non-symbol content
returns grouped matches in a file-oriented structure
can include surrounding lines through context_lines
supports semantic=true for embedding-based search (same provider requirements as search_symbols)

Representative result shape:

[
  {
    "file": "src/main.py",
    "matches": [
      {
        "line": 42,
        "text": "TODO: refactor this path",
        "before": ["..."],
        "after": ["..."]
      }
    ]
  }
]

`search_columns` — Search column metadata across indexed models

{
  "repo": "owner/repo",
  "query": "customer_id",
  "model_pattern": "stg_*",
  "max_results": 20
}

Searches column metadata emitted by context providers (dbt, SQLMesh, database catalogs, etc.). Returns model name, file path, column name, and description.

Behavioral notes:

only returns results when the index was built with an active provider that emits column data
model_pattern narrows results to matching model names
intended for data-engineering workflows and lineage exploration

Relationship and Impact Analysis

`get_related_symbols` — Find structurally or heuristically related symbols

{
  "repo": "owner/repo",
  "symbol_id": "src/auth.py::AuthService.login#method"
}

Finds symbols related to a target symbol using heuristics such as same-file co-location, shared importers, or token overlap in names.

`get_class_hierarchy` — Traverse inheritance structure

{
  "repo": "owner/repo",
  "symbol_id": "src/services.py::UserService#class"
}

Returns class hierarchy information above and below the target symbol, including known indexed bases and derived classes where they can be identified.

`get_blast_radius` — Estimate impacted files or symbols

{
  "repo": "owner/repo",
  "symbol_id": "src/core.py::process_order#function",
  "include_depth_scores": false
}

Estimates likely impact by traversing reverse import relationships and inspecting relevant importers.

Behavioral notes:

distinguishes confirmed from potential impact where enough evidence exists
always returns overall_risk_score (0.0–1.0, weighted by hop distance: 1/depth^0.7) and direct_dependents_count
include_depth_scores=true adds impact_by_depth — confirmed files grouped by BFS layer, each with a risk_score
flat confirmed/potential lists are preserved unchanged (backward compatible)
intended for change-planning and refactoring workflows

`get_symbol_importance` — Rank symbols by architectural centrality

{
  "repo": "owner/repo",
  "top_n": 20,
  "algorithm": "pagerank"
}

Returns the most architecturally important symbols in a repo, ranked by PageRank or in-degree on the import graph.

Behavioral notes:

algorithm: "pagerank" (damping=0.85, convergence threshold=1e-6, max 100 iterations) or "degree" (raw in-degree count, O(1))
scope (string) — restrict to a subdirectory or file glob
response includes symbol_id, rank, score, in_degree, out_degree, kind, and iterations_to_converge
PageRank scores are cached per CodeIndex load and recomputed only on incremental reindex

`find_dead_code` — Find unreachable symbols and files

{
  "repo": "owner/repo",
  "granularity": "symbol",
  "min_confidence": 0.8,
  "include_tests": false
}

Finds symbols and files unreachable from any entry point via BFS over the import graph.

Behavioral notes:

entry points auto-detected: main.py, __main__.py, conftest.py, manage.py, __init__.py package roots, if __name__ == "__main__" guards (Python), and common framework decorators
granularity: "symbol" (default) or "file" (a file is dead only when all its symbols are dead)
min_confidence (float, default 0.8) — lower values surface more candidates including ambiguous cases
entry_point_patterns (list) — additional glob patterns to treat as live roots
confidence scoring: 1.0 = zero importers, no framework decoration; 0.9 = zero importers in a test file; 0.7 = all importers are themselves dead (cascading)

`get_changed_symbols` — Map a git diff to affected symbols

{
  "repo": "owner/repo",
  "since_sha": null,
  "until_sha": "HEAD",
  "include_blast_radius": false
}

Maps a git diff to the symbols that were added, modified, removed, or renamed.

Behavioral notes:

since_sha defaults to the SHA stored at index time; until_sha defaults to "HEAD"
change_type values: "added", "removed", "modified", "renamed" (body-hash-identical rename heuristic)
include_blast_radius=true appends downstream importers per changed symbol (up to max_blast_depth hops)
requires a locally indexed repo (index_folder); GitHub-indexed repos return a structured error
requires git on PATH; graceful error if not available
filters index-storage files from the diff when the storage dir is inside the repo

`find_importers` — Find files that import a given file

{
  "repo": "owner/repo",
  "file_path": "src/auth.py"
}

Answers "what uses this file?" by walking the stored import graph. Each result indicates whether the importer is itself imported by other files.

Behavioral notes:

accepts file_path (single) or file_paths (batch)
returns has_importers boolean per result for quick fan-out assessment
useful for understanding coupling before refactoring

`find_references` — Find files that reference an identifier

{
  "repo": "owner/repo",
  "identifier": "UserService"
}

Answers "where is this used?" by combining import-graph analysis and identifier-name matching. For dbt, also traces {{ ref() }} and {{ source() }} edges.

Behavioral notes:

accepts identifier (single) or identifiers (batch)
returns matches grouped by type: import references, content references, model references

`check_references` — Dead-code detection combining imports and content search

{
  "repo": "owner/repo",
  "identifier": "deprecated_helper"
}

Combines find_references and search_text into one call. Returns is_referenced boolean for quick dead-code checks plus detailed matches when references exist.

Behavioral notes:

accepts identifier (single) or identifiers (batch)
search_content controls whether to include text-search fallback (default true)
intended as a single-call replacement for the common "is this symbol used anywhere?" pattern

`get_dependency_graph` — File-level dependency graph

{
  "repo": "owner/repo",
  "file": "src/core.py",
  "direction": "both",
  "depth": 2
}

Traverses import relationships to build a file-level dependency graph.

Behavioral notes:

direction: "imports" (files this file depends on), "importers" (files that depend on this file), or "both"
depth: number of hops to traverse (1–3)
useful for understanding module coupling and change impact

`get_symbol_diff` — Compare indexed symbol states across snapshots

{
  "repo_a": "owner/repo-branch-a",
  "repo_b": "owner/repo-branch-b"
}

Reports added, removed, or changed symbols by comparing two indexed snapshots using (name, kind) and content_hash. Index the same repo under two names to compare branches.

Observability

`get_session_stats` — Token savings statistics

No input required.

Returns per-session and all-time token savings statistics, per-tool breakdown, session duration, and cost-avoidance estimates across model price points.

Data Models

Symbol

@dataclass
class Symbol:
    id: str
    file: str
    name: str
    qualified_name: str
    kind: str
    language: str
    signature: str
    content_hash: str = ""
    docstring: str = ""
    summary: str = ""
    decorators: list[str]
    keywords: list[str]
    parent: str | None
    line: int = 0
    end_line: int = 0
    byte_offset: int = 0
    byte_length: int = 0
    ecosystem_context: str = ""

Symbol field semantics

id: stable identifier of the form {file_path}::{qualified_name}#{kind}
file: relative file path within the indexed repository
name: local symbol name
qualified_name: dotted or container-qualified path including parent context
kind: normalized symbol category such as function, class, method, constant, or type
language: normalized language label
signature: signature line or equivalent declaration text
content_hash: SHA-256 hash of the source bytes used for drift detection and diffing
docstring: docstring or nearest available inline documentation when extracted
summary: condensed human- or model-consumable description
decorators: decorators, attributes, annotations, or equivalent modifiers
keywords: auxiliary search keywords
parent: parent symbol ID where applicable
line / end_line: 1-indexed line span within the original file
byte_offset / byte_length: exact byte range in the cached raw file
ecosystem_context: provider-derived business or ecosystem metadata

CodeIndex

@dataclass
class CodeIndex:
    repo: str
    owner: str
    name: str
    indexed_at: str
    index_version: int
    source_files: list[str]
    languages: dict[str, int]
    symbols: list[dict]
    file_hashes: dict[str, str]
    git_head: str
    source_root: str
    file_languages: dict[str, str]
    display_name: str

CodeIndex field semantics

repo: canonical repository identifier used by the local store
owner / name: remote repository components where applicable
indexed_at: ISO timestamp of index creation
index_version: schema version for compatibility control
source_files: included files after filtering
languages: file-count distribution by language
symbols: serialized symbol records, excluding raw source payloads
file_hashes: file-level hashes used for incremental indexing
git_head: repository revision marker where available
source_root: absolute local source root for local indexes
file_languages: language per file mapping
display_name: friendly display label for local repository identities

Repository Acquisition and File Discovery

GitHub Repositories

GitHub repositories are typically enumerated through a single recursive tree request:

GET /repos/{owner}/{repo}/git/trees/HEAD?recursive=1

File candidates are then filtered through the same safety and eligibility pipeline used for local folders before content is fetched and parsed.

Local Folders

Local folders are discovered through recursive directory walking with path safety, ignore handling, secret exclusion, and binary detection.

Filtering Pipeline

Both remote and local acquisition paths flow through the same conceptual filtering pipeline:

Extension filter File extension must map to a supported language or file type.
Skip patterns Excludes directories and files such as node_modules/, vendor/, .git/, build artifacts, lock files, minified assets, and other low-value or generated content.
.gitignore handling Ignore semantics are respected through pathspec-based matching where applicable.
Secret detection Files such as .env, *.pem, *.key, *.p12, and similar credential-bearing artifacts are excluded.
Binary detection Uses extension-based heuristics together with null-byte or content-based detection.
Size limit Files exceeding configured size bounds are skipped.
File count limit Indexing is capped by a configurable file-count limit, with priority typically given to high-value source directories before lower-priority remainder paths.

Indexing Semantics

The indexing process performs the following conceptual stages:

discover candidate files
apply security and filtering rules
detect language support
parse source into AST form
extract normalized symbol records
post-process overloads and hashes
enrich symbols through context providers where available
generate summaries
persist index metadata and raw file cache

Incremental indexing

Incremental indexing avoids reprocessing unchanged files by comparing stored file hashes and repository metadata.

Key behaviors:

Mtime fast-path: files whose mtime is unchanged since the last index are skipped without reading content
Hash comparison: files with changed mtimes are hashed; if the hash matches the stored value, parsing is skipped
Watcher fast-path: when the watcher provides a pre-known change set via changed_paths, full directory discovery is skipped entirely — only the affected files are processed
Memory hash cache: the watcher supplies old_hash from its in-memory cache, allowing index_folder to skip loading the stored index from SQLite
Git tree SHA short-circuiting: unchanged remote indexes can be detected via the tree SHA
Cross-process locking: file locks prevent concurrent index corruption
Atomic writes: index updates use atomic write patterns to prevent partial-write corruption

Watcher and Index Freshness

Watcher architecture

The file watcher monitors local folders for changes and triggers incremental reindexing automatically. It runs as a background task in the same event loop as the MCP server (when --watcher is enabled) or as a standalone process via the watch subcommand.

Key components:

Debounce: filesystem events are debounced (default 2000ms, configurable via JCODEMUNCH_WATCH_DEBOUNCE_MS) before triggering a reindex
Memory hash cache: the watcher maintains an in-memory dict[str, str] mapping rel_path → content_hash. On each debounce tick, the watcher compares incoming file hashes against in-memory hashes rather than loading the full SQLite index (~57ms savings per reindex)
WatcherChange: changed files are communicated to index_folder as WatcherChange(change_type, abs_path, old_hash) NamedTuples, where old_hash is provided from the memory cache
Fast path: when changed_paths is provided, index_folder skips full directory discovery (~3s on Windows) and only processes affected files

Deferred summarization

When AI summaries are enabled, the fast path splits the indexing pipeline into two phases:

Immediate (critical path): tree-sitter parse → incremental_save with empty summaries
Deferred (background daemon thread): AI summarization → second incremental_save to update summaries

A monotonic generation counter prevents stale deferred work from overwriting fresher data: the deferred thread captures the current generation before starting and checks it again before saving.

Per-repo reindex state

Each repo tracked by the watcher has independent reindex state:

reindexing: whether a reindex is actively in progress
stale_since: monotonic timestamp when the index first became stale
consecutive_failures: incremented on failure, reset on success
deferred_generation: monotonic counter for deferred-work cancellation

State is managed through three lifecycle functions:

mark_reindex_start(repo) — sets reindexing flag, clears event, records stale_since
mark_reindex_done(repo) — clears reindexing, sets event, clears stale_since and failures
mark_reindex_failed(repo, error) — clears reindexing, sets event (unblocks waiters), increments failures, preserves stale_since

Freshness modes

The server supports two freshness modes, controlled by freshness_mode config key (or --freshness-mode CLI flag):

relaxed (default): query tools return immediately regardless of reindex state.
strict: all read query tools automatically wait up to 500ms for any in-progress reindex to complete before returning. Write tools (index_folder, index_repo, index_file, invalidate_cache) and utility tools (list_repos, get_session_stats) are excluded from the wait.

The strict-mode wait uses threading.Event.wait() dispatched via asyncio.to_thread to avoid blocking the event loop.

Search and Ranking Semantics

Symbol search model

Symbol search combines multiple ranking signals, including:

exact name matches
substring matches
token overlap
signature terms
summary terms
docstring and keyword relevance
structure-derived signals such as file centrality

The search model is therefore hybrid rather than purely lexical.

Bounded result handling

When only top-k results are required, the system may use bounded heap strategies rather than full sorting over all candidates.

Text search model

Text search is file-content oriented and is intended for cases where the desired material is not naturally represented as a symbol. Matching is case-insensitive and may return surrounding context lines.

Retrieval Semantics

Byte-offset retrieval

get_symbol_source and related retrieval tools access cached raw file content by stored byte offsets and lengths. This avoids reparsing or rescanning full files and preserves exact source fidelity.

Verification

When verify is requested, the retrieved content is re-hashed and compared to the stored content_hash. Verification status is surfaced through _meta.

Batch retrieval

get_symbol_source accepts either symbol_id (string, returns flat object) or symbol_ids (array, returns {symbols, errors}). This reduces overhead for workflows that require several related definitions simultaneously.

Contextual retrieval

get_context_bundle is intended to package useful surrounding material while preserving a bounded payload. This capability is designed to reduce repeated tool orchestration for common reasoning workflows.

Response Envelope

All tool responses return an _meta object containing operational metadata.

Representative envelope:

{
  "_meta": {
    "powered_by": "jcodemunch-mcp by jgravelle · https://github.com/jgravelle/jcodemunch-mcp",
    "timing_ms": 42,
    "repo": "owner/repo",
    "symbol_count": 387,
    "truncated": false,
    "content_verified": true,
    "tokens_saved": 2450,
    "total_tokens_saved": 184320,
    "estimate_method": "...",
  }
}

Common `_meta` fields

timing_ms: elapsed execution time in milliseconds
repo: repository identifier
symbol_count: symbol count where relevant to the operation
truncated: whether the result was truncated or bounded
content_verified: whether verification succeeded when requested
tokens_saved: per-call token-savings estimate
total_tokens_saved: cumulative saved-token estimate across calls
estimate_method: label describing how the savings estimate was computed

The exact _meta shape may vary by tool, but the response contract emphasizes explicit operational metadata rather than opaque output.

Error Handling

Errors return a structured object containing a human-readable message and minimal timing metadata.

Representative shape:

{
  "error": "Human-readable message",
  "_meta": {
    "timing_ms": 1
  }
}

Expected error behaviors

Scenario	Behavior
Repository not found	returns an error message
GitHub rate limited	returns an error with reset guidance and recommends `GITHUB_TOKEN`
Individual file fetch fails	file is skipped; indexing continues
Individual file parse fails	file is skipped; indexing continues
No source files found	returns an error
Symbol ID not found	returns an error or per-item error entry
Repository not indexed	returns an error suggesting indexing first
AI summarization fails	falls back to docstring or signature
Index version mismatch	old index is ignored; reindex required

The error model is designed so that partial failures during indexing do not necessarily abort the entire operation.

Transport Modes and CLI

Subcommands

Subcommand	Purpose
`serve`	Run the MCP server (default when no subcommand given)
`watch`	Watch folders for changes and auto-reindex (standalone)
`watch-claude`	Auto-discover and watch Claude Code worktrees
`hook-event`	Record a worktree lifecycle event (used by hooks)

Transport modes (`serve`)

The serve subcommand supports three transport modes via --transport or JCODEMUNCH_TRANSPORT:

stdio (default): standard input/output, suitable for MCP clients that launch the server as a subprocess
sse: HTTP with Server-Sent Events, for persistent connections
streamable-http: HTTP with streamable response bodies

HTTP transports bind to --host / --port (defaults: 127.0.0.1:8901) and support optional bearer token authentication via JCODEMUNCH_HTTP_TOKEN.

Serve flags

Flag	Purpose
`--transport {stdio,sse,streamable-http}`	Transport mode
`--host HOST`	HTTP bind address
`--port PORT`	HTTP listen port
`--watcher[=BOOL]`	Enable background file watcher
`--watcher-path PATH [PATH...]`	Folders to watch (default: cwd)
`--watcher-debounce MS`	Debounce interval in ms
`--watcher-idle-timeout MINUTES`	Auto-stop after N idle minutes
`--watcher-no-ai-summaries`	Disable AI summaries for watcher
`--watcher-log [PATH]`	Log watcher output to file
`--freshness-mode {relaxed,strict}`	Freshness mode for query tools

Environment Variables

Variable	Purpose	Required
`GITHUB_TOKEN`	GitHub API authentication, higher limits, private repository support	No
`ANTHROPIC_API_KEY`	enables Anthropic-based summaries	No
`ANTHROPIC_MODEL`	overrides the Anthropic summary model	No
`GOOGLE_API_KEY`	enables Gemini-based summaries when Anthropic is not configured	No
`GOOGLE_MODEL`	overrides the Gemini summary model	No
`OPENAI_API_BASE`	enables local or remote OpenAI-compatible summary backends	No
`OPENAI_MODEL`	model name for OpenAI-compatible summary backends	No
`OPENAI_API_KEY`	authentication for OpenAI-compatible summary backends	No
`OPENAI_CONCURRENCY`	concurrency control for summary batching	No
`OPENAI_BATCH_SIZE`	batch sizing for OpenAI-compatible summarization	No
`OPENAI_MAX_TOKENS`	max output tokens for compatible summarizers	No
`CODE_INDEX_PATH`	custom storage path	No
`JCODEMUNCH_CONTEXT_PROVIDERS`	enables or disables provider enrichment	No
`JCODEMUNCH_MAX_INDEX_FILES`	overrides the default file-count limit	No
`JCODEMUNCH_LOG_FILE`	directs logging to file instead of stderr in stdio sessions	No
`JCODEMUNCH_SHARE_SAVINGS`	enables or disables community savings reporting	No
`JCODEMUNCH_PATH_MAP`	remaps stored path prefixes at retrieval time; format: `orig1=new1,orig2=new2` — allows an index built on one machine (e.g. Linux `/home/user`) to be reused on another (e.g. Windows `C:\Users\user`) without re-indexing. Each pair is split on the last `=`, so `=` signs within path components are preserved. Pairs are comma-separated; path components containing commas are not supported. First matching prefix wins.	No
`JCODEMUNCH_REDACT_SOURCE_ROOT`	redacts absolute path details from output	No
`JCODEMUNCH_TRANSPORT`	transport mode: `stdio`, `sse`, or `streamable-http`	No
`JCODEMUNCH_HOST`	HTTP bind address (default `127.0.0.1`)	No
`JCODEMUNCH_PORT`	HTTP listen port (default `8901`)	No
`JCODEMUNCH_HTTP_TOKEN`	bearer token for HTTP transport authentication	No
`JCODEMUNCH_FRESHNESS_MODE`	freshness mode: `relaxed` (default) or `strict`	No
`JCODEMUNCH_WATCH_DEBOUNCE_MS`	watcher debounce interval in ms (default `2000`)	No
`JCODEMUNCH_USE_AI_SUMMARIES`	default for `use_ai_summaries` flag (`true`/`false`)	No
`JCODEMUNCH_CLAUDE_POLL_INTERVAL`	poll interval in seconds for `watch-claude` git polling	No

Security and Safety Controls

The specification assumes the following security controls are part of compliant operation:

path traversal prevention
symlink escape protection
secret-file exclusion
binary file exclusion
safe encoding handling
.gitignore respect where appropriate
SSRF prevention for configurable API base URLs
ReDoS protection in text search
safe temporary-file behavior
optional HTTP bearer authentication for HTTP transport (via JCODEMUNCH_HTTP_TOKEN)
source-root redaction when configured

These protections apply to repository discovery, file loading, search, retrieval, and optional external-summary integrations.

Performance and Operational Notes

Local-first persistence

Indexes and raw file caches are stored locally to make repeat search and retrieval fast and to avoid redundant remote fetches.

Sidecars and metadata shortcuts

Metadata sidecars may be used so repo-listing operations do not require loading full index payloads.

Cache behavior

The store may use LRU-like caching and mtime invalidation to reduce repeated disk and parse costs.

File locking

Cross-process locking is used to reduce the risk of index corruption under concurrent access.

Watch mode

The file watcher monitors directories and triggers incremental reindexing automatically. It can run embedded in the serve process (via --watcher) or standalone (via the watch subcommand).

Key optimizations:

Memory hash cache: avoids loading the full SQLite index on each debounce tick (~57ms savings)
Watcher fast path: when the watcher knows the exact changed files, index_folder skips full directory discovery (~3s → ~50ms on Windows)
Deferred summarization: AI summaries are computed in a background thread, so the index is available immediately with empty summaries that are filled in asynchronously
Per-repo backpressure: each watched repo has independent reindex state with threading.Event-based signaling; in strict freshness mode, query tools automatically wait for reindex completion

The watch-claude variant extends this for Claude Code specifically: it discovers worktrees via hook-driven events (WorktreeCreate/WorktreeRemove writing to a JSONL manifest) and/or by polling git worktree list on specified repositories. Both mechanisms are cross-platform and layout-agnostic.

Token Savings Semantics

jCodeMunch reports token savings as an operational estimate.

Conceptual basis

Savings are derived from the difference between a larger baseline payload, such as raw file content or broader retrieval, and the smaller actual response returned by the tool.

Reporting fields

tokens_saved refers to the current call
total_tokens_saved refers to the cumulative persisted total
estimate_method indicates how the figure was calculated

Important interpretation note

Savings are strongest when clients use structured retrieval instead of brute-force file reading. Installing the system alone does not guarantee savings unless the client actually uses the tool surface for code lookup and navigation.

Compatibility and Evolution Notes

This specification describes the current contract and intended behavior at a high level. The following evolution principles apply:

tool capabilities may expand over time
ranking internals may change without altering the conceptual contract
data-model fields may grow as compatibility permits
index-version changes may require reindexing
optional integrations and providers may broaden without changing the core retrieval model

The stable foundation of the specification is:

repository or folder indexing
structured symbol extraction
local persistence of metadata and raw source
search and retrieval through explicit tools
operational metadata through _meta
bounded, deterministic access to code for AI-assisted workflows

FilesExpand file tree

SPEC.md

Latest commit

History

SPEC.md

File metadata and controls

Technical Specification

Table of Contents

Overview

Core Operating Model

Tool Surface

Indexing and Repository Management

index_repo — Index a GitHub repository

index_folder — Index a local folder

index_file — Re-index a single file

embed_repo — Precompute symbol embeddings for semantic search

invalidate_cache — Delete index for a repository

list_repos — List indexed repositories

resolve_repo — Resolve a path to a repo identifier

Discovery and Repository Inspection

get_file_tree — Get file structure

get_file_outline — Get symbols in a file

get_file_content — Get cached file content

get_repo_outline — High-level repository overview

suggest_queries — Suggest high-value initial queries

Retrieval

get_symbol_source — Get full source of one or more symbols

get_context_bundle — Retrieve a bounded contextual package around a symbol or symbol set

get_ranked_context — Query-driven token-budgeted context assembler

Search

search_symbols — Search across indexed symbols

search_text — Full-text search across cached file contents

search_columns — Search column metadata across indexed models

Relationship and Impact Analysis

get_related_symbols — Find structurally or heuristically related symbols

get_class_hierarchy — Traverse inheritance structure

get_blast_radius — Estimate impacted files or symbols

get_symbol_importance — Rank symbols by architectural centrality

find_dead_code — Find unreachable symbols and files

get_changed_symbols — Map a git diff to affected symbols

find_importers — Find files that import a given file

find_references — Find files that reference an identifier

check_references — Dead-code detection combining imports and content search

get_dependency_graph — File-level dependency graph

get_symbol_diff — Compare indexed symbol states across snapshots

Observability

get_session_stats — Token savings statistics

Data Models

Symbol

Symbol field semantics

CodeIndex

CodeIndex field semantics

Repository Acquisition and File Discovery

GitHub Repositories

Local Folders

Filtering Pipeline

Indexing Semantics

Incremental indexing

Watcher and Index Freshness

Watcher architecture

Deferred summarization

Per-repo reindex state

Freshness modes

Search and Ranking Semantics

Symbol search model

Bounded result handling

Text search model

Retrieval Semantics

Byte-offset retrieval

Verification

Batch retrieval

Contextual retrieval

Response Envelope

Common _meta fields

Error Handling

Expected error behaviors

Transport Modes and CLI

Subcommands

Transport modes (serve)

Serve flags

`index_repo` — Index a GitHub repository

`index_folder` — Index a local folder

`index_file` — Re-index a single file

`embed_repo` — Precompute symbol embeddings for semantic search

`invalidate_cache` — Delete index for a repository

`list_repos` — List indexed repositories

`resolve_repo` — Resolve a path to a repo identifier

`get_file_tree` — Get file structure

`get_file_outline` — Get symbols in a file

`get_file_content` — Get cached file content

`get_repo_outline` — High-level repository overview

`suggest_queries` — Suggest high-value initial queries

`get_symbol_source` — Get full source of one or more symbols

`get_context_bundle` — Retrieve a bounded contextual package around a symbol or symbol set

`get_ranked_context` — Query-driven token-budgeted context assembler

`search_symbols` — Search across indexed symbols

`search_text` — Full-text search across cached file contents

`search_columns` — Search column metadata across indexed models

`get_related_symbols` — Find structurally or heuristically related symbols

`get_class_hierarchy` — Traverse inheritance structure

`get_blast_radius` — Estimate impacted files or symbols

`get_symbol_importance` — Rank symbols by architectural centrality

`find_dead_code` — Find unreachable symbols and files

`get_changed_symbols` — Map a git diff to affected symbols

`find_importers` — Find files that import a given file

`find_references` — Find files that reference an identifier

`check_references` — Dead-code detection combining imports and content search

`get_dependency_graph` — File-level dependency graph

`get_symbol_diff` — Compare indexed symbol states across snapshots

`get_session_stats` — Token savings statistics

Common `_meta` fields

Transport modes (`serve`)