feat(graph): investigate GraphRAG with Neo4j

## Summary

Investigate adding a graph layer to Axon's RAG pipeline using Neo4j. Pure vector search finds semantically similar chunks but is blind to relationships — who authored what, which issues link to which PRs, which sessions reference which repos, which skills call which tools. A knowledge graph captures these connections and enables traversal-based retrieval that vector search cannot do.

## Why GraphRAG

Standard vector RAG limitation: each chunk is retrieved independently. There's no way to ask:
- "Show me all issues linked to this PR and the files it changed"
- "Which of my sessions discussed this GitHub repo, and what did I conclude?"
- "What skills reference this MCP tool, and what docs explain it?"
- "Find the dependency chain: this function → the module it's in → the repo → related issues"

A graph adds edges between nodes that vector similarity can't express:
- `Issue --FIXED_BY--> PR --MODIFIES--> File`
- `Session --REFERENCES--> Repo --HAS_ISSUE--> Issue`
- `Skill --USES--> McpTool --DOCUMENTED_BY--> DocChunk`
- `Commit --AUTHORED_BY--> User --ALSO_AUTHORED--> Issue`

## Investigation Scope

This is a research + prototype issue. The goal is to determine whether Neo4j adds enough value to justify the operational complexity, and if so, how to integrate it with the existing Qdrant vector store.

### Phase 1: Feasibility

**Questions to answer:**
1. What graph traversal queries would meaningfully improve RAG quality for Axon's use cases?
2. Which metadata we already collect maps naturally to graph edges?
3. What's the operational cost? (Neo4j memory, query latency, sync complexity)
4. Neo4j vs alternatives: `neo4j` (full-featured, heavy), `EdgeDB`, `SurrealDB` (multi-model), `kuzu` (embedded graph DB — no separate service), `memgraph`
5. Is a lightweight embedded graph (`kuzu`) sufficient, or do we need full Neo4j?

**Recommended starting point:** `kuzu` — embedded (no Docker service), Rust bindings available, zero ops cost, sufficient for our scale.

### Phase 2: Data Model (if feasibility passes)

Proposed node types:
```
(:Repo   {owner, name, stars, language, url})
(:File   {path, extension, language, url})
(:Issue  {number, state, title, labels[], url})
(:PR     {number, state, merged, url})
(:Chunk  {qdrant_id, url, source_type})   ← bridge to vector store
(:Session {platform, project, date, url})
(:Skill  {name, platform, path})
(:User   {login, platform})
```

Proposed edge types:
```
(:Repo)-[:HAS_FILE]->(:File)
(:Repo)-[:HAS_ISSUE]->(:Issue)
(:Repo)-[:HAS_PR]->(:PR)
(:PR)-[:FIXES]->(:Issue)
(:PR)-[:MODIFIES]->(:File)
(:Issue)-[:AUTHORED_BY]->(:User)
(:Chunk)-[:BELONGS_TO]->(:Repo|:File|:Issue|:PR|:Session)
(:Session)-[:REFERENCES]->(:Repo)
(:Skill)-[:INVOKES]->(:Skill)
```

### Phase 3: Hybrid Retrieval

GraphRAG pipeline:
1. Vector search (Qdrant) → top-K candidate chunks
2. Graph expansion (Neo4j/kuzu) → traverse edges from candidates to pull related nodes
3. Merge + rerank → combined result set, deduplicated
4. LLM context assembly → richer, more connected context

Example: query "fix for issue 423" →
- Vector finds: PR #87 description chunk
- Graph expands: PR #87 → fixes Issue #423 → Issue body + comments → modifies `src/lib.rs` → File chunks
- Result: full context from vector + relationship traversal

### Phase 4: Integration Points

- **Ingest pipeline**: write graph nodes/edges during GitHub/Reddit/YouTube/sessions ingest
- **`axon ask`**: optional graph expansion step (`--graph` flag)
- **`axon query`**: `--traverse <depth>` for graph-expanded results
- **Web UI**: knowledge graph visualization (D3 / react-force-graph) on the `/cortex` page
- **MCP tool**: `{ "action": "graph", "subaction": "traverse", "start": "repo:owner/name", "depth": 2 }`

## Stack Recommendation (to validate)

| Option | Pros | Cons |
|--------|------|------|
| **kuzu** | Embedded, no service, fast, Rust bindings | Less mature, smaller community |
| **Neo4j** | Industry standard, rich query language (Cypher), excellent tooling | Heavy (JVM), requires Docker service, licensing |
| **memgraph** | Neo4j-compatible Cypher, lighter weight | Less adoption |
| **SurrealDB** | Multi-model (graph + document + relational), single service | Rust but opinionated |

**Recommendation**: prototype with `kuzu` (embedded, zero ops cost). If it proves insufficient, migrate to Neo4j.

## Deliverables for This Issue

- [ ] Feasibility report: which queries justify graph, what's the ops cost, which DB to use
- [ ] Proof of concept: ingest one GitHub repo into a graph, run 3 traversal queries, measure quality vs pure vector
- [ ] Data model proposal: node/edge types for all current ingest sources
- [ ] Decision: adopt + plan full implementation, or defer with documented rationale
- [ ] If adopted: implementation plan filed as a follow-up issue

## References

- Microsoft GraphRAG paper: `https://arxiv.org/abs/2404.16130` — index for crawling
- kuzu Rust bindings: `https://github.com/kuzudb/kuzu`
- Neo4j memory MCP server already in our stack (see `mcp__neo4j-memory__*` tools) — potential reuse

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(graph): investigate GraphRAG with Neo4j #36

Summary

Why GraphRAG

Investigation Scope

Phase 1: Feasibility

Phase 2: Data Model (if feasibility passes)

Phase 3: Hybrid Retrieval

Phase 4: Integration Points

Stack Recommendation (to validate)

Deliverables for This Issue

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Option	Pros	Cons
kuzu	Embedded, no service, fast, Rust bindings	Less mature, smaller community
Neo4j	Industry standard, rich query language (Cypher), excellent tooling	Heavy (JVM), requires Docker service, licensing
memgraph	Neo4j-compatible Cypher, lighter weight	Less adoption
SurrealDB	Multi-model (graph + document + relational), single service	Rust but opinionated

feat(graph): investigate GraphRAG with Neo4j #36

Description

Summary

Why GraphRAG

Investigation Scope

Phase 1: Feasibility

Phase 2: Data Model (if feasibility passes)

Phase 3: Hybrid Retrieval

Phase 4: Integration Points

Stack Recommendation (to validate)

Deliverables for This Issue

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions