Skip to content

feat(graph): investigate GraphRAG with Neo4j #36

@jmagar

Description

@jmagar

Summary

Investigate adding a graph layer to Axon's RAG pipeline using Neo4j. Pure vector search finds semantically similar chunks but is blind to relationships — who authored what, which issues link to which PRs, which sessions reference which repos, which skills call which tools. A knowledge graph captures these connections and enables traversal-based retrieval that vector search cannot do.

Why GraphRAG

Standard vector RAG limitation: each chunk is retrieved independently. There's no way to ask:

  • "Show me all issues linked to this PR and the files it changed"
  • "Which of my sessions discussed this GitHub repo, and what did I conclude?"
  • "What skills reference this MCP tool, and what docs explain it?"
  • "Find the dependency chain: this function → the module it's in → the repo → related issues"

A graph adds edges between nodes that vector similarity can't express:

  • Issue --FIXED_BY--> PR --MODIFIES--> File
  • Session --REFERENCES--> Repo --HAS_ISSUE--> Issue
  • Skill --USES--> McpTool --DOCUMENTED_BY--> DocChunk
  • Commit --AUTHORED_BY--> User --ALSO_AUTHORED--> Issue

Investigation Scope

This is a research + prototype issue. The goal is to determine whether Neo4j adds enough value to justify the operational complexity, and if so, how to integrate it with the existing Qdrant vector store.

Phase 1: Feasibility

Questions to answer:

  1. What graph traversal queries would meaningfully improve RAG quality for Axon's use cases?
  2. Which metadata we already collect maps naturally to graph edges?
  3. What's the operational cost? (Neo4j memory, query latency, sync complexity)
  4. Neo4j vs alternatives: neo4j (full-featured, heavy), EdgeDB, SurrealDB (multi-model), kuzu (embedded graph DB — no separate service), memgraph
  5. Is a lightweight embedded graph (kuzu) sufficient, or do we need full Neo4j?

Recommended starting point: kuzu — embedded (no Docker service), Rust bindings available, zero ops cost, sufficient for our scale.

Phase 2: Data Model (if feasibility passes)

Proposed node types:

(:Repo   {owner, name, stars, language, url})
(:File   {path, extension, language, url})
(:Issue  {number, state, title, labels[], url})
(:PR     {number, state, merged, url})
(:Chunk  {qdrant_id, url, source_type})   ← bridge to vector store
(:Session {platform, project, date, url})
(:Skill  {name, platform, path})
(:User   {login, platform})

Proposed edge types:

(:Repo)-[:HAS_FILE]->(:File)
(:Repo)-[:HAS_ISSUE]->(:Issue)
(:Repo)-[:HAS_PR]->(:PR)
(:PR)-[:FIXES]->(:Issue)
(:PR)-[:MODIFIES]->(:File)
(:Issue)-[:AUTHORED_BY]->(:User)
(:Chunk)-[:BELONGS_TO]->(:Repo|:File|:Issue|:PR|:Session)
(:Session)-[:REFERENCES]->(:Repo)
(:Skill)-[:INVOKES]->(:Skill)

Phase 3: Hybrid Retrieval

GraphRAG pipeline:

  1. Vector search (Qdrant) → top-K candidate chunks
  2. Graph expansion (Neo4j/kuzu) → traverse edges from candidates to pull related nodes
  3. Merge + rerank → combined result set, deduplicated
  4. LLM context assembly → richer, more connected context

Example: query "fix for issue 423" →

  • Vector finds: PR #87 description chunk
  • Graph expands: PR #87 → fixes Issue #423 → Issue body + comments → modifies src/lib.rs → File chunks
  • Result: full context from vector + relationship traversal

Phase 4: Integration Points

  • Ingest pipeline: write graph nodes/edges during GitHub/Reddit/YouTube/sessions ingest
  • axon ask: optional graph expansion step (--graph flag)
  • axon query: --traverse <depth> for graph-expanded results
  • Web UI: knowledge graph visualization (D3 / react-force-graph) on the /cortex page
  • MCP tool: { "action": "graph", "subaction": "traverse", "start": "repo:owner/name", "depth": 2 }

Stack Recommendation (to validate)

Option Pros Cons
kuzu Embedded, no service, fast, Rust bindings Less mature, smaller community
Neo4j Industry standard, rich query language (Cypher), excellent tooling Heavy (JVM), requires Docker service, licensing
memgraph Neo4j-compatible Cypher, lighter weight Less adoption
SurrealDB Multi-model (graph + document + relational), single service Rust but opinionated

Recommendation: prototype with kuzu (embedded, zero ops cost). If it proves insufficient, migrate to Neo4j.

Deliverables for This Issue

  • Feasibility report: which queries justify graph, what's the ops cost, which DB to use
  • Proof of concept: ingest one GitHub repo into a graph, run 3 traversal queries, measure quality vs pure vector
  • Data model proposal: node/edge types for all current ingest sources
  • Decision: adopt + plan full implementation, or defer with documented rationale
  • If adopted: implementation plan filed as a follow-up issue

References

  • Microsoft GraphRAG paper: https://arxiv.org/abs/2404.16130 — index for crawling
  • kuzu Rust bindings: https://github.com/kuzudb/kuzu
  • Neo4j memory MCP server already in our stack (see mcp__neo4j-memory__* tools) — potential reuse

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions