-
Notifications
You must be signed in to change notification settings - Fork 0
feat(graph): investigate GraphRAG with Neo4j #36
Description
Summary
Investigate adding a graph layer to Axon's RAG pipeline using Neo4j. Pure vector search finds semantically similar chunks but is blind to relationships — who authored what, which issues link to which PRs, which sessions reference which repos, which skills call which tools. A knowledge graph captures these connections and enables traversal-based retrieval that vector search cannot do.
Why GraphRAG
Standard vector RAG limitation: each chunk is retrieved independently. There's no way to ask:
- "Show me all issues linked to this PR and the files it changed"
- "Which of my sessions discussed this GitHub repo, and what did I conclude?"
- "What skills reference this MCP tool, and what docs explain it?"
- "Find the dependency chain: this function → the module it's in → the repo → related issues"
A graph adds edges between nodes that vector similarity can't express:
Issue --FIXED_BY--> PR --MODIFIES--> FileSession --REFERENCES--> Repo --HAS_ISSUE--> IssueSkill --USES--> McpTool --DOCUMENTED_BY--> DocChunkCommit --AUTHORED_BY--> User --ALSO_AUTHORED--> Issue
Investigation Scope
This is a research + prototype issue. The goal is to determine whether Neo4j adds enough value to justify the operational complexity, and if so, how to integrate it with the existing Qdrant vector store.
Phase 1: Feasibility
Questions to answer:
- What graph traversal queries would meaningfully improve RAG quality for Axon's use cases?
- Which metadata we already collect maps naturally to graph edges?
- What's the operational cost? (Neo4j memory, query latency, sync complexity)
- Neo4j vs alternatives:
neo4j(full-featured, heavy),EdgeDB,SurrealDB(multi-model),kuzu(embedded graph DB — no separate service),memgraph - Is a lightweight embedded graph (
kuzu) sufficient, or do we need full Neo4j?
Recommended starting point: kuzu — embedded (no Docker service), Rust bindings available, zero ops cost, sufficient for our scale.
Phase 2: Data Model (if feasibility passes)
Proposed node types:
(:Repo {owner, name, stars, language, url})
(:File {path, extension, language, url})
(:Issue {number, state, title, labels[], url})
(:PR {number, state, merged, url})
(:Chunk {qdrant_id, url, source_type}) ← bridge to vector store
(:Session {platform, project, date, url})
(:Skill {name, platform, path})
(:User {login, platform})
Proposed edge types:
(:Repo)-[:HAS_FILE]->(:File)
(:Repo)-[:HAS_ISSUE]->(:Issue)
(:Repo)-[:HAS_PR]->(:PR)
(:PR)-[:FIXES]->(:Issue)
(:PR)-[:MODIFIES]->(:File)
(:Issue)-[:AUTHORED_BY]->(:User)
(:Chunk)-[:BELONGS_TO]->(:Repo|:File|:Issue|:PR|:Session)
(:Session)-[:REFERENCES]->(:Repo)
(:Skill)-[:INVOKES]->(:Skill)
Phase 3: Hybrid Retrieval
GraphRAG pipeline:
- Vector search (Qdrant) → top-K candidate chunks
- Graph expansion (Neo4j/kuzu) → traverse edges from candidates to pull related nodes
- Merge + rerank → combined result set, deduplicated
- LLM context assembly → richer, more connected context
Example: query "fix for issue 423" →
- Vector finds: PR #87 description chunk
- Graph expands: PR #87 → fixes Issue #423 → Issue body + comments → modifies
src/lib.rs→ File chunks - Result: full context from vector + relationship traversal
Phase 4: Integration Points
- Ingest pipeline: write graph nodes/edges during GitHub/Reddit/YouTube/sessions ingest
axon ask: optional graph expansion step (--graphflag)axon query:--traverse <depth>for graph-expanded results- Web UI: knowledge graph visualization (D3 / react-force-graph) on the
/cortexpage - MCP tool:
{ "action": "graph", "subaction": "traverse", "start": "repo:owner/name", "depth": 2 }
Stack Recommendation (to validate)
| Option | Pros | Cons |
|---|---|---|
| kuzu | Embedded, no service, fast, Rust bindings | Less mature, smaller community |
| Neo4j | Industry standard, rich query language (Cypher), excellent tooling | Heavy (JVM), requires Docker service, licensing |
| memgraph | Neo4j-compatible Cypher, lighter weight | Less adoption |
| SurrealDB | Multi-model (graph + document + relational), single service | Rust but opinionated |
Recommendation: prototype with kuzu (embedded, zero ops cost). If it proves insufficient, migrate to Neo4j.
Deliverables for This Issue
- Feasibility report: which queries justify graph, what's the ops cost, which DB to use
- Proof of concept: ingest one GitHub repo into a graph, run 3 traversal queries, measure quality vs pure vector
- Data model proposal: node/edge types for all current ingest sources
- Decision: adopt + plan full implementation, or defer with documented rationale
- If adopted: implementation plan filed as a follow-up issue
References
- Microsoft GraphRAG paper:
https://arxiv.org/abs/2404.16130— index for crawling - kuzu Rust bindings:
https://github.com/kuzudb/kuzu - Neo4j memory MCP server already in our stack (see
mcp__neo4j-memory__*tools) — potential reuse