GitHub - drewburchfield/obsidian-graph: Semantic knowledge graph navigation for Obsidian or markdown vaults using AI-powered vector embeddings and PostgreSQL+pgvector

Semantic knowledge graph engine for markdown vaults. Discovers hidden connections between notes using AI-powered vector embeddings and PostgreSQL+pgvector. Accessible to any AI app or harness compatible with MCP.

Overview

Obsidian Graph builds a semantic knowledge graph of your markdown vault, discovering relationships between notes that go beyond keywords and explicit links. It embeds your notes as vectors using Voyage Context-3, stores them in PostgreSQL+pgvector, and provides tools for semantic search, multi-hop graph traversal, hub detection, and orphan analysis.

Designed for Obsidian vaults but works with any folder of markdown files. Connects to any AI app or harness compatible with the Model Context Protocol (MCP).

Features

Semantic Search: Find notes by meaning, not just keywords
Connection Discovery: Multi-hop BFS graph traversal to map note relationships
Hub Analysis: Identify highly connected conceptual anchors (MOC candidates)
Orphan Detection: Find isolated insights that need integration
Auto-Indexing: Automatic file watching with 30-second debounce
Superior Quality: Voyage Context-3 (1024d) vs typical 384d embeddings

Architecture

┌─ obsidian-graph container ─────────────────┐
│                                            │
│  MCP Client ◄──stdio──► server.py          │
│                            │               │
│                     ┌──────┴──────┐        │
│                     ▼             ▼        │
│              graph_builder   hub_analyzer  │
│              embedder.py     file_watcher  │
│                  │               │         │
│                  │ HTTPS         │ watch   │
│                  ▼               ▼         │
│            Voyage AI API    /vault (ro)    │
│                  │                         │
│                  │ 1024d vectors           │
│                  ▼                         │
│           vector_store.py                  │
│                  │                         │
└──────────────────┼─────────────────────────┘
                   │ SQL
                   ▼
┌─ obsidian-graph-pgvector container ────────┐
│  PostgreSQL 15 + pgvector (HNSW index)     │
└────────────────────────────────────────────┘

Embeddings: Voyage Context-3 (1024 dimensions, contextualized)
Vector Store: PostgreSQL 15+ with pgvector HNSW indexing
Performance: 0.9ms search (555x better than target), <2s graph building
File Watching: Watchdog with polling mode for cloud sync compatibility
Transport: Docker stdio for MCP communication

MCP Tools

Overview

All tools use semantic similarity via 1024-dimensional Voyage Context-3 embeddings. Similarity scores range from 0.0 (unrelated) to 1.0 (identical). Default threshold is 0.5 (clear connection).

How it works:

Notes are embedded as vectors in 1024-dimensional space
Cosine similarity measures semantic closeness between vectors
HNSW index enables sub-millisecond vector search
Results ranked by similarity score (0.0-1.0)

Tool Reference

Tool	Purpose	Method	Performance	Use Case
`search_notes`	Semantic search across vault	Query embedding → vector search	<1ms	Find notes by concept
`get_similar_notes`	Find notes similar to given note	Note embedding → vector search	<300ms	Discover related ideas
`get_connection_graph`	Multi-hop BFS graph traversal	Recursive similarity search	<2s	Map knowledge networks
`get_hub_notes`	Identify highly connected notes	Materialized connection counts	<100ms	Find conceptual anchors
`get_orphaned_notes`	Find isolated notes	Materialized connection counts	<100ms	Unintegrated insights

Methodology Details

search_notes:

Generates query embedding using Voyage Context-3
Performs cosine similarity search against all note embeddings
Returns top-k most similar notes above threshold
HNSW index enables O(log n) search complexity

get_similar_notes:

Fetches source note's embedding from database
Searches for notes with similar embeddings
Excludes source note from results
Useful for exploring conceptual neighborhoods

get_connection_graph:

Uses Breadth-First Search (BFS) for level-by-level exploration
Prevents cycles by tracking visited nodes
Builds multi-hop network (depth 1-5 levels)
Each level: finds top-k most similar notes from previous level
Returns: nodes (with level), edges (with similarity), stats

get_hub_notes:

Uses materialized connection_count column (O(1) query)
Connection count = # of notes above threshold similarity
Background refresh when >50% of counts are stale
Identifies notes with many semantic connections
High hub scores → good MOC (Map of Content) candidates

get_orphaned_notes:

Uses materialized connection_count column
Finds notes with few semantic connections
Sorted by: connection count (ASC), modified date (DESC)
Shows recent notes first (likely new insights)
Helps identify notes needing integration

Chunking Support

For large notes (>30k tokens):

Automatically split into sentence-aligned chunks (target: ~2000 characters, 0 overlap)
Chunking algorithm breaks at sentence boundaries (. or \n\n) for readability
Chunk sizes vary (1800-2200 chars) to preserve sentence integrity
Embedded in batches of 60 chunks (preserves context)
Voyage Context-3 maintains semantic coherence across chunks
Each chunk stored separately with chunk_index
Search returns individual chunks (can aggregate by path)

Example: 168k-char note → ~87 variable-sized chunks → 2 batches (60+27) → context preserved

Most Obsidian notes are <10k tokens and embedded whole (single chunk).

Prerequisites

Voyage AI Account Setup

This server requires a Voyage AI API key for generating embeddings:

Create account: Sign up at https://www.voyageai.com/
Get API key: Visit https://dashboard.voyageai.com/ → API Keys → Create new key
Add payment method (Important!):
- Go to https://dashboard.voyageai.com/billing
- Add a payment method (credit card)
- Why: Without payment, rate limit is only 3 RPM (unusable)
- With payment: 300 RPM rate limit unlocked
Free tier: Voyage Context-3 includes 200M free tokens (one-time per account):
- First 200M tokens are FREE
- Sufficient for indexing ~50,000 notes
- After free tier: ~$0.12 per 1M tokens

Cost estimate: Indexing 1,000 notes ≈ 4M tokens ≈ $0.48 (or free if within 200M token limit)

Installation

Clone the repository:

git clone https://github.com/drewburchfield/obsidian-graph.git
cd obsidian-graph

Configure environment:

cp .env.example .env
# Edit .env with your settings:
#   - VOYAGE_API_KEY (from https://dashboard.voyageai.com/)
#   - OBSIDIAN_VAULT_PATH (absolute path to your vault)
#   - POSTGRES_PASSWORD (generate with: openssl rand -base64 36)

Start services:

docker-compose up -d

Initial indexing (first time only):

docker exec -i obsidian-graph python -m src.indexer

Indexes entire vault (30-60 min for large vaults). After this, file watching handles incremental updates.

Add to MCP client (~/.mcp.json):

{
  "mcpServers": {
    "obsidian-graph": {
      "command": "docker",
      "args": ["exec", "-i", "obsidian-graph", "python", "-m", "src.server"],
      "disabled": false
    }
  }
}

Configuration

Required Environment Variables

# Voyage AI
VOYAGE_API_KEY=your_key_here  # Get from https://www.voyageai.com/

# PostgreSQL (POSTGRES_HOST is set by docker-compose.yml, no need to set in .env)
POSTGRES_PASSWORD=your_secure_password_here  # Generate with: openssl rand -base64 36

# Obsidian Vault
OBSIDIAN_VAULT_PATH=/path/to/your/vault  # Absolute path on your system

Optional Tuning

# File watching
OBSIDIAN_WATCH_ENABLED=true
OBSIDIAN_DEBOUNCE_SECONDS=30

# Polling mode (auto-enabled for Docker and cloud-synced vaults)
# OBSIDIAN_WATCH_USE_POLLING=       # true | false (unset = auto-detect)
# OBSIDIAN_WATCH_POLLING_INTERVAL=30  # seconds between polls (default: 30)

# Performance
POSTGRES_MIN_CONNECTIONS=5
POSTGRES_MAX_CONNECTIONS=20
EMBEDDING_BATCH_SIZE=128
EMBEDDING_REQUESTS_PER_MINUTE=300

Cloud Sync Support (iCloud, Google Drive, Dropbox, OneDrive)

If your Obsidian vault is stored in a cloud-synced folder, the file watcher automatically uses polling mode for reliable change detection. This is because Docker's filesystem events don't propagate reliably through cloud sync virtualization layers.

Auto-detection: Polling mode is automatically enabled when:

Running inside Docker (always uses polling for reliability)
Vault path contains cloud sync patterns (Library/Mobile Documents, Library/CloudStorage, etc.)

How it works:

Polling mode compares directory snapshots every 30 seconds (configurable)
Detects file creates, modifications, moves, and deletions
Slightly higher CPU than native filesystem events, but works reliably everywhere

Mobile workflow: Edit notes on mobile (iOS/Android) via Obsidian's iCloud/Google Drive sync. Changes sync to your Mac, and the polling watcher detects them within the polling interval.

Override behavior:

# Force polling on (for edge cases)
OBSIDIAN_WATCH_USE_POLLING=true

# Force native events (may miss changes with cloud sync)
OBSIDIAN_WATCH_USE_POLLING=false

# Faster detection (higher CPU)
OBSIDIAN_WATCH_POLLING_INTERVAL=15

Excluding Folders from Indexing

By default, the indexer excludes common system and tool folders:

.obsidian/ / .trash/ / .Trash/ (Obsidian system)
.git/ / .github/ (version control)
.vscode/ / .cursor/ (editor config)
.claude/ / .aider/ / .smart-env/ (AI tools)

Custom Exclusions: To exclude additional folders (like a soft-delete folder), create .obsidian-graph.conf in your vault root:

# Exclude soft delete folder
07_Archive/Trash/

# Exclude drafts
drafts/

See docs/obsidian-graph.conf.example for more patterns and examples.

Pattern Syntax:

Pattern	Matches
`folder/`	All files in `folder/` and subfolders
`drafts/*`	All files directly in `drafts/`
`*.tmp.md`	All files ending in `.tmp.md`

Security

This server implements multiple security layers to protect your vault:

Path Traversal Protection: Validates all file paths stay within vault (src/security_utils.py)
Input Validation: All parameters validated before processing (src/validation.py)
Secure Credentials: Random generated database passwords (scripts/generate-db-password.sh)
Container Isolation: Read-only vault mount, dropped capabilities, non-root user

Concurrency: See docs/CONCURRENCY.md for thread-safety guarantees and race condition prevention.

Running Security Tests

# Security tests
pytest tests/test_security*.py -v

# Input validation tests
pytest tests/test_validation.py -v

# Race condition tests
pytest tests/test_race_conditions.py -v

# All tests with coverage
pytest tests/ --cov=src --cov-report=html

Usage Examples

Semantic Search

search_notes(query="neural networks and consciousness", limit=10, threshold=0.5)

Returns notes semantically related to the query, even if they don't contain
the exact keywords.

Find Similar Notes

get_similar_notes(note_path="neuroscience/dopamine.md", limit=10, threshold=0.6)

Discovers notes conceptually similar to dopamine note (might find:
reward-systems.md, motivation.md, decision-making.md)

Build Connection Graph

get_connection_graph(
  note_path="philosophy/free-will.md",
  depth=3,
  max_per_level=5,
  threshold=0.65
)

Maps 3-level network showing how free-will connects to neuroscience,
psychology, and ethics notes through semantic similarity.

Identify Hubs

get_hub_notes(min_connections=10, threshold=0.5, limit=20)

Finds notes with >=10 connections - candidates for Maps of Content (MOCs).
Example: "decision-making.md" might connect to psychology, neuroscience,
economics, and philosophy notes.

Find Orphans

get_orphaned_notes(max_connections=2, limit=20)

Identifies isolated notes that need integration into knowledge graph.
Sorted by modification date to surface recent unconnected insights.

Performance

Validated metrics:

Metric	Target	Actual	Status
Search latency	<500ms	0.9ms	✅ 555x better
Graph building (depth=3)	<2s	<2s	✅ On target
Hub/orphan queries	<100ms	<100ms	✅ Materialized
Similarity range	[0.0-1.0]	[0.0-1.0]	✅ Validated
Embedding quality	1024-dim	1024-dim	✅ Voyage Context-3

Performance Note: Metrics measured on development vault (~500 notes, M1 MacBook Pro). Actual performance depends on vault size, hardware (CPU/RAM/SSD), and database configuration. HNSW indexing provides O(log n) search, so performance degrades gracefully with vault size.

Troubleshooting

"Reduced rate limits of 3 RPM"

Cause: No payment method on Voyage account
Solution: Add payment method at https://dashboard.voyageai.com/
Note: 200M free tokens still apply

"PostgreSQL connection failed"

# Check postgres container
docker ps | grep obsidian-graph-pgvector
docker logs obsidian-graph-pgvector

# Verify credentials
grep POSTGRES_ .env

"Note not found" errors

Ensure initial indexing completed: docker exec -i obsidian-graph python -m src.indexer
Check vault path is mounted: docker exec -i obsidian-graph ls /vault

File changes not detected

Verify OBSIDIAN_WATCH_ENABLED=true
Check logs: docker logs obsidian-graph
Look for: Watching vault: /vault [polling (interval: 30s)]
File watcher starts after PostgreSQL connection
Cloud sync users: Changes take up to polling interval (default 30s) plus cloud sync time
Reduce detection time: Set OBSIDIAN_WATCH_POLLING_INTERVAL=15 in .env

Development

Running Tests

# Quick validation
docker exec -i obsidian-graph python test_e2e.py

# Unit tests (requires 300 RPM rate limits)
docker exec -i obsidian-graph pytest tests/ -v

Rebuilding

docker-compose build obsidian-graph
docker-compose restart obsidian-graph

Debugging

# View logs
docker logs -f obsidian-graph

# Interactive shell
docker exec -it obsidian-graph /bin/bash

# Check database
docker exec -it obsidian-graph-pgvector psql -U obsidian -d obsidian_graph

Comparison to mcp-obsidian

Feature	mcp-obsidian	obsidian-graph
Embeddings	384-dim (all-MiniLM-L6-v2)	1024-dim (Voyage Context-3)
Vector Store	ChromaDB	PostgreSQL+pgvector
Tools	2 (search, reindex)	5 (search, similar, graph, hubs, orphans)
Search perf	Unknown	0.9ms validated
Graph traversal	❌ No	✅ BFS with cycle prevention
Hub detection	❌ No	✅ Materialized stats

License

See LICENSE file for details.

Links

Voyage AI: https://www.voyageai.com/
pgvector: https://github.com/pgvector/pgvector
MCP Protocol: https://modelcontextprotocol.io/

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Architecture

MCP Tools

Overview

Tool Reference

Methodology Details

Chunking Support

Prerequisites

Voyage AI Account Setup

Installation

Configuration

Required Environment Variables

Optional Tuning

Cloud Sync Support (iCloud, Google Drive, Dropbox, OneDrive)

Excluding Folders from Indexing

Security

Running Security Tests

Usage Examples

Semantic Search

Find Similar Notes

Build Connection Graph

Identify Hubs

Find Orphans

Performance

Troubleshooting

"Reduced rate limits of 3 RPM"

"PostgreSQL connection failed"

"Note not found" errors

File changes not detected

Development

Running Tests

Rebuilding

Debugging

Comparison to mcp-obsidian

License

Links

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages