-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Agent Brain 2026 Strategic Recommendations
Date: February 4, 2026
Document Version: 1.0
Classification: Technical Architecture & Roadmap
Executive Summary
This document provides a comprehensive analysis of Agent Brain's current architecture and recommends bleeding-edge enhancements aligned with the 2026 state-of-the-art in RAG systems, vector databases, embedding models, and AI agent integration.
Agent Brain has strong foundations: AST-aware code chunking, multi-modal retrieval (BM25/Vector/Graph/Hybrid), and per-project isolation. However, significant opportunities exist to leverage 2026 advances in:
- Late Interaction Reranking (ColBERTv2) for sub-100ms precision improvements
- Streaming Vector Updates (LiveVectorLake architecture) for real-time indexing
- Voyage 4 Embeddings outperforming OpenAI by 14%
- Native MCP Integration eliminating the plugin→CLI→server latency chain
- Agentic GraphRAG with LlamaIndex Workflows for multi-step reasoning
Table of Contents
- Current State Assessment
- 2026 Technology Landscape
- Strategic Recommendations
- Implementation Roadmap
- Architecture Evolution
- Risk Analysis
- Sources & References
1. Current State Assessment
1.1 Architectural Strengths
| Component | Implementation | Assessment |
|---|---|---|
| AST-Aware Chunking | tree-sitter for 9 languages | Industry-leading approach |
| Multi-Modal Retrieval | 5 modes with RSF/RRF fusion | Comprehensive coverage |
| Per-Project Isolation | Separate servers, auto-port allocation | Clean architecture |
| Provider Abstraction | OpenAI/Ollama/Cohere/Anthropic | Good extensibility |
| LlamaIndex Foundation | BM25Retriever, PropertyGraphIndex | Solid primitives |
1.2 Critical Gaps Identified
Technical Debt (34 Pending Tasks)
GraphRAG Implementation:
├── T017-T029: Graph query mode - NOT STARTED
├── T030-T042: Multi-mode fusion - PARTIAL
├── T043-T047: AST-based code relationships - NOT STARTED
└── Kuzu backend support - NOT STARTED
Pluggable Providers:
├── T047-T052: Offline operation (Ollama) - NOT STARTED
├── T053-T058: API key security - NOT STARTED
└── T063-T067: Provider mismatch detection - NOT STARTED
Multi-Instance:
├── T059-T061: Integration tests - NOT STARTED
└── T062-T067: Shared daemon mode - NOT STARTED
Performance Bottlenecks
| Issue | Impact | Current State |
|---|---|---|
| Embedding Generation | 50-90% of indexing time | Sequential batch processing |
| BM25 Post-Filtering | 3x over-fetch, unvalidated | No native metadata filtering |
| Graph Memory Limit | ~100K triplets max | SimplePropertyGraphStore in RAM |
| No Query Caching | Repeated queries recomputed | No LRU cache |
| Blocking Indexing | 409 errors during index | Single-threaded, no queue |
Testing Coverage Gaps
| Area | Status | Risk |
|---|---|---|
| GraphRAG E2E | 0% | HIGH - Feature non-functional |
| Provider E2E | ~40% | MEDIUM - 5 providers untested |
| Multi-instance E2E | 0% | HIGH - Isolation unvalidated |
| Performance benchmarks | 0% | MEDIUM - No baseline metrics |
2. 2026 Technology Landscape
2.1 RAG State-of-the-Art: Two-Stage Retrieval with Late Interaction
The industry has converged on two-stage RAG architectures combining:
- Stage 1: Fast Retrieval - BM25/SPLADE + Vector search (high recall)
- Stage 2: Precision Reranking - ColBERTv2 late interaction (high precision)
ColBERTv2 Performance (January 2026 Research)
"On PubMedQA, ColBERTv2 re-ranking yields up to +4.2 pp gain in Recall@3 and +3.13 pp average accuracy improvement when fine-tuned with in-batch negatives."
"Inference latency is approximately 31.4 ms for query encoding and 26.3 ms for re-ranking, totaling 57.7 ms per query. Sub-100ms latency enables interactive applications."
How Late Interaction Works:
Traditional Bi-Encoder: Late Interaction (ColBERT):
Query → [CLS] embedding Query → [token₁, token₂, ..., tokenₙ] embeddings
Doc → [CLS] embedding Doc → [token₁, token₂, ..., tokenₘ] embeddings
Score = cosine(q, d) Score = Σ max(sim(qᵢ, dⱼ)) for all i (MaxSim)
ColBERT precomputes document token embeddings offline, enabling fast scoring at query time while maintaining token-level precision.
Recommended Pipeline for Agent Brain
┌─────────────────────────────────────────────────────────────────┐
│ Agent Brain Query Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ Query │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Stage 1: Candidate Retrieval (top_k=100) │ │
│ │ ├── BM25 (keyword precision) │ │
│ │ ├── Vector (semantic recall) │ │
│ │ └── Graph (relationship traversal) │ │
│ │ → Reciprocal Rank Fusion │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ ~50 candidates │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Stage 2: ColBERTv2 Reranking (top_k=10) │ │
│ │ └── Token-level MaxSim scoring │ │
│ │ → Sub-100ms latency │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ Final results │
└─────────────────────────────────────────────────────────────────┘
2.2 Vector Database Evolution
2026 Benchmark Comparison
| Database | QPS @ 50M vectors | Latency (p50) | Billions Scale | Local Option |
|---|---|---|---|---|
| Qdrant | 41.47 QPS @ 99% recall | 20-50ms | Yes | Yes |
| pgvectorscale | 471 QPS | 10-20ms | No (10-100M max) | Yes |
| Milvus/Zilliz | Best-in-class | <10ms | Yes | Yes |
| Pinecone | Enterprise-grade | 20-50ms | Yes | No |
| ChromaDB (current) | Not benchmarked | Variable | No | Yes |
Key Insight: pgvector with pgvectorscale now outperforms Qdrant for workloads under 100M vectors, while providing PostgreSQL's full-text search (replacing BM25) and ACID transactions.
Recommendation: Prioritize Phase 6 (PostgreSQL Backend)
-- Single PostgreSQL instance replaces 3 storage backends:
-- 1. pgvector for vector similarity (replaces ChromaDB)
-- 2. tsvector for full-text search (replaces BM25)
-- 3. JSONB for graph storage (replaces SimplePropertyGraphStore)
CREATE TABLE chunks (
id UUID PRIMARY KEY,
content TEXT,
embedding vector(3072),
content_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
metadata JSONB,
graph_triplets JSONB
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON chunks USING gin (content_tsv);
CREATE INDEX ON chunks USING gin (graph_triplets jsonb_path_ops);2.3 Embedding Models: Voyage 4 Dominance
2026 Embedding Benchmark Results
| Model | Relative Performance | Cost | Best For |
|---|---|---|---|
| Voyage 4-large | Baseline (+0%) | $$$ | Maximum accuracy |
| Voyage 4 | -1.87% | $$ | Production balance |
| Voyage 3.5-lite | -4.80% | $ | Cost-effective RAG |
| Gemini Embedding 001 | -3.87% | $$ | Google ecosystem |
| Cohere Embed v4 | -8.20% | $$ | Multilingual |
| OpenAI v3 Large | -14.05% | $$ | Legacy compatibility |
Critical Finding: Agent Brain's current default (OpenAI text-embedding-3-large) is 14% less accurate than Voyage 4-large.
Specialized Code Embeddings
For code-specific retrieval, consider GraphCodeBERT which:
- Encodes semantic-level structure via data flow graphs
- Captures "where-the-value-comes-from" relationships between variables
- Pre-trained on 6 programming languages
GraphCodeBERT Architecture:
┌─────────────────────────────────────────────┐
│ Code: def foo(x): return x + 1 │
│ │
│ Token Embedding + Data Flow Graph │
│ [def][foo][x]... x ──defines──> param_x │
│ return ──uses──> x │
│ │
│ → Semantic-aware code representation │
└─────────────────────────────────────────────┘
2.4 GraphRAG: Microsoft's Production Architecture
Microsoft's GraphRAG (now in Azure Discovery) uses:
- LLM Entity Extraction - Extract named entities and descriptions from text chunks
- Hierarchical Leiden Clustering - Form semantic communities in the graph
- Community Summarization - LLM-generated summaries for each cluster
- Query-Focused Synthesis - Traverse graph + summaries at query time
LlamaIndex Integration (Agentic GraphRAG)
# LlamaIndex 2026 PropertyGraph + Agentic Workflow
from llama_index.core import PropertyGraphIndex
from llama_index.core.workflow import Workflow, step
class AgenticGraphRAG(Workflow):
@step
async def retrieve(self, query: str) -> list[Node]:
# Stage 1: Multi-modal retrieval
vector_results = await self.vector_index.aretrieve(query)
graph_results = await self.graph_index.aretrieve(query)
return self.fuse_rrf(vector_results, graph_results)
@step
async def reflect(self, results: list[Node]) -> ReflectionOutput:
# Stage 2: Agent reflection - are results sufficient?
return await self.llm.areflect(results, self.query)
@step
async def synthesize(self, results: list[Node]) -> str:
# Stage 3: Generate answer with citations
return await self.llm.asynthesize(results)2.5 Real-Time Indexing: LiveVectorLake Architecture
The LiveVectorLake paper (January 2026) introduces a production architecture for streaming vector updates:
LiveVectorLake Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Change Detection Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ File Watcher │ │ Git Hooks │ │ DB Triggers │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Content-Addressable Hashing │ │
│ │ SHA256(chunk_content) → embedding_cache │ │
│ │ Skip embedding if hash exists (50-80% speedup) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Dual-Tier Storage │ │
│ │ Hot Tier: In-memory for recent changes │ │
│ │ Cold Tier: Persistent for historical data │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ACID Transactions │ │
│ │ Atomic index updates │ │
│ │ Consistent query results during updates │ │
│ │ Isolated concurrent access │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Performance Results:
- 10-15% content re-processing during updates (vs 100% for full re-index)
- Sub-100ms query latency during indexing
- 100% temporal query accuracy
2.6 MCP Native Integration
The Model Context Protocol has evolved significantly:
- 75+ connectors in Claude's directory
- MCP Apps for interactive UIs within Claude
- Tool Search for optimizing 1000s of tools at scale
- Donated to Agentic AI Foundation (Linux Foundation) in December 2025
Current Agent Brain Integration
Plugin → subprocess → CLI → HTTP → Server
↑ ↑
~50-100ms latency ~10-20ms latency
Recommended MCP Native Integration
Claude ←──MCP──→ Agent Brain Server
↑
~5-10ms latency (direct protocol)
3. Strategic Recommendations
3.1 Critical Priority (P0) - Complete Before Production
R1: Complete GraphRAG Implementation
Current State: Foundation done, query execution not implemented
Tasks: T017-T029 (Graph queries), T043-T047 (AST code relationships)
Effort: ~120 hours
Impact: Unlocks the "what calls this function" use case - a core differentiator
Implementation Approach:
# Use LlamaIndex's PropertyGraphIndex with LLM extraction
from llama_index.core.indices.property_graph import PropertyGraphIndex
from llama_index.core.indices.property_graph.extractors import (
ImplicitPathExtractor,
SimpleLLMPathExtractor,
)
# For code, augment with AST-derived relationships
class CodeRelationshipExtractor:
def extract(self, code_chunk: CodeChunk) -> list[GraphTriple]:
relationships = []
# From AST metadata already extracted by CodeChunker
for import_stmt in code_chunk.imports:
relationships.append(GraphTriple(
subject=code_chunk.symbol_name,
predicate="imports",
object=import_stmt
))
for call in code_chunk.function_calls:
relationships.append(GraphTriple(
subject=code_chunk.symbol_name,
predicate="calls",
object=call
))
return relationshipsR2: Implement Embedding Cache with Content Hashing
Current State: Every re-index regenerates all embeddings
Expected Improvement: 50-80% reduction in indexing time
Effort: ~40 hours
Implementation:
import hashlib
from pathlib import Path
class EmbeddingCache:
def __init__(self, cache_dir: Path):
self.cache_dir = cache_dir
self.cache_dir.mkdir(exist_ok=True)
def get_or_compute(self, content: str, embed_fn: Callable) -> list[float]:
content_hash = hashlib.sha256(content.encode()).hexdigest()
cache_file = self.cache_dir / f"{content_hash}.npy"
if cache_file.exists():
return np.load(cache_file).tolist()
embedding = embed_fn(content)
np.save(cache_file, np.array(embedding))
return embeddingR3: Add ColBERTv2 Reranking Stage
Current State: Single-stage retrieval only
Expected Improvement: +3-4% accuracy, sub-100ms additional latency
Effort: ~60 hours
Implementation Options:
- RAGatouille - Python library wrapping ColBERTv2
- Jina Reranker API - Hosted ColBERT-style reranking
- Self-hosted ColBERTv2 - Maximum control
from ragatouille import RAGPretrainedModel
class TwoStageRetriever:
def __init__(self):
self.colbert = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
async def retrieve(self, query: str, top_k: int = 10) -> list[Result]:
# Stage 1: Fast retrieval (existing)
candidates = await self.hybrid_retrieve(query, top_k=100)
# Stage 2: ColBERT reranking
docs = [c.content for c in candidates]
reranked = self.colbert.rerank(query, docs, k=top_k)
return [candidates[i] for i, _ in reranked]3.2 High Priority (P1) - Next Release Cycle
R4: Upgrade Default Embedding Provider to Voyage 4
Current State: OpenAI text-embedding-3-large (14% less accurate)
Expected Improvement: +14% retrieval accuracy
Effort: ~20 hours (provider already pluggable)
# config.yaml
embedding:
provider: voyage
model: voyage-4-large # or voyage-3.5-lite for cost-effective
dimensions: 1024R5: Implement Native MCP Server
Current State: Plugin → subprocess → CLI → HTTP → Server
Expected Improvement: 5-10x latency reduction, simplified architecture
Effort: ~80 hours
from mcp import Server, Tool
class AgentBrainMCPServer(Server):
@Tool
async def search(self, query: str, mode: str = "hybrid") -> list[dict]:
"""Search the indexed knowledge base."""
return await self.query_service.execute_query(
QueryRequest(query=query, mode=mode)
)
@Tool
async def index(self, path: str, include_code: bool = True) -> dict:
"""Index documents and code."""
return await self.indexing_service.start_indexing(
IndexRequest(folder_path=path, include_code=include_code)
)R6: Background Indexing Queue
Current State: Indexing blocks server, returns 409 for concurrent requests
Expected Improvement: Non-blocking indexing, query during index
Effort: ~60 hours
from asyncio import Queue
from dataclasses import dataclass
@dataclass
class IndexJob:
job_id: str
request: IndexRequest
priority: int = 0
class BackgroundIndexer:
def __init__(self):
self.queue: Queue[IndexJob] = Queue()
self.current_job: IndexJob | None = None
async def enqueue(self, request: IndexRequest) -> str:
job = IndexJob(job_id=uuid4().hex, request=request)
await self.queue.put(job)
return job.job_id
async def process_queue(self):
while True:
self.current_job = await self.queue.get()
await self._run_indexing(self.current_job)
self.current_job = NoneR7: File Watcher for Auto-Indexing
Current State: Manual re-index required after code changes
Expected Improvement: Zero-friction index maintenance
Effort: ~40 hours
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class AutoIndexer(FileSystemEventHandler):
def __init__(self, indexer: BackgroundIndexer, debounce_ms: int = 5000):
self.indexer = indexer
self.debounce_ms = debounce_ms
self.pending_files: set[Path] = set()
def on_modified(self, event):
if self._should_index(event.src_path):
self.pending_files.add(Path(event.src_path))
self._schedule_debounced_index()
async def _schedule_debounced_index(self):
await asyncio.sleep(self.debounce_ms / 1000)
files = self.pending_files.copy()
self.pending_files.clear()
await self.indexer.enqueue_incremental(files)3.3 Medium Priority (P2) - Q2 2026
R8: PostgreSQL Backend (Consolidate Storage)
Current State: 3 separate storage systems (ChromaDB, BM25, SimplePropertyGraphStore)
Expected Improvement: Unified storage, ACID transactions, better scaling
Effort: ~160 hours
Benefits of PostgreSQL Consolidation:
- pgvector - Vector similarity with HNSW indexing
- tsvector - Native full-text search (replaces BM25)
- JSONB - Graph storage with path queries
- ACID - Transactional consistency during updates
- Scaling - Well-understood operational model to 100M+ vectors
R9: Agentic GraphRAG with LlamaIndex Workflows
Current State: Static query execution
Expected Improvement: Multi-step reasoning, self-correction
Effort: ~80 hours
from llama_index.core.workflow import Workflow, step, StartEvent, StopEvent
class AgenticRAGWorkflow(Workflow):
@step
async def retrieve(self, ev: StartEvent) -> RetrieveEvent:
results = await self.retriever.aretrieve(ev.query)
return RetrieveEvent(results=results)
@step
async def evaluate(self, ev: RetrieveEvent) -> EvaluateEvent:
# LLM judges if results are sufficient
judgment = await self.llm.ajudge_relevance(ev.results, self.query)
if judgment.needs_more_context:
# Reformulate query and retrieve again
return RetrieveEvent(query=judgment.reformulated_query)
return EvaluateEvent(results=ev.results, sufficient=True)
@step
async def synthesize(self, ev: EvaluateEvent) -> StopEvent:
answer = await self.llm.asynthesize(ev.results)
return StopEvent(result=answer)R10: Code-Specific Embedding Model
Current State: Generic text embeddings for code
Expected Improvement: Better code search accuracy
Effort: ~40 hours
Options:
- Voyage Code - Specialized code embedding model
- GraphCodeBERT - Open source, data-flow aware
- StarCoder Embeddings - 80+ languages, 15B parameters
# config.yaml - per source_type embedding
embedding:
document:
provider: voyage
model: voyage-4-large
code:
provider: voyage
model: voyage-code-33.4 Lower Priority (P3) - Q3-Q4 2026
R11: LiveVectorLake-Style Streaming Updates
Implement the full streaming architecture from the LiveVectorLake paper:
- Content-addressable hashing (R2 is first step)
- Dual-tier storage (hot/cold)
- ACID transactions during updates
- Temporal queries ("what was indexed yesterday?")
R12: Multi-Repository Federated Search
Enable searching across multiple projects simultaneously:
- Shared daemon mode (already spec'd as Phase 5)
- Cross-project RRF fusion
- Organization-wide code search
R13: VS Code Extension
Native IDE integration:
- Sidebar search panel
- Inline results with code preview
- "Find in Knowledge Base" command
- Hover documentation from indexed docs
R14: Query Explanation and Debugging
Help users understand search results:
- Score breakdown (vector_score, bm25_score, graph_score)
- Matching term highlighting
- Entity path visualization for graph results
explain=truequery parameter
4. Implementation Roadmap
Phase 1: Foundation Fixes (February-March 2026)
| Week | Deliverable | Owner | Dependencies |
|---|---|---|---|
| 1-2 | Embedding Cache (R2) | Core | None |
| 2-3 | Complete GraphRAG queries (R1 partial) | Core | None |
| 3-4 | ColBERTv2 Reranking (R3) | Core | None |
| 4 | Integration tests for all above | QA | R1, R2, R3 |
Exit Criteria:
- Incremental indexing 50%+ faster
- Graph queries functional end-to-end
- Reranking improves top-5 precision by 3%+
Phase 2: Performance & Integration (April-May 2026)
| Week | Deliverable | Owner | Dependencies |
|---|---|---|---|
| 1-2 | Voyage 4 embedding upgrade (R4) | Core | None |
| 2-4 | Native MCP Server (R5) | Core | None |
| 3-4 | Background indexing queue (R6) | Core | None |
| 4 | File watcher auto-index (R7) | Core | R6 |
Exit Criteria:
- 14% accuracy improvement from Voyage 4
- Sub-20ms query latency via MCP
- Non-blocking indexing with progress streaming
Phase 3: Architecture Evolution (June-August 2026)
| Week | Deliverable | Owner | Dependencies |
|---|---|---|---|
| 1-4 | PostgreSQL backend (R8) | Core | None |
| 4-6 | Agentic GraphRAG workflows (R9) | Core | R1, R8 |
| 6-8 | Code-specific embeddings (R10) | Core | R4 |
Exit Criteria:
- Single PostgreSQL instance replaces 3 storage backends
- Multi-step agentic queries functional
- Code search accuracy improved by measured benchmark
Phase 4: Polish & Extensions (September-December 2026)
| Deliverable | Priority | Effort |
|---|---|---|
| LiveVectorLake streaming (R11) | P3 | 120h |
| Multi-repo federation (R12) | P3 | 80h |
| VS Code extension (R13) | P3 | 120h |
| Query explanation (R14) | P3 | 40h |
5. Architecture Evolution
Current Architecture (v1.2.0)
┌─────────────────────────────────────────────────────────────────┐
│ Claude Code │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Plugin (Markdown) → subprocess → CLI → HTTP → Server │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ agent-brain-server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ChromaDB │ │ BM25 JSON │ │ SimpleGraph │ │
│ │ (vectors) │ │ (keywords) │ │ (in RAM) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ External Providers │
│ OpenAI (embeddings) │ Anthropic (summaries) │ Ollama (local) │
└─────────────────────────────────────────────────────────────────┘
Target Architecture (v2.0.0 - Q4 2026)
┌─────────────────────────────────────────────────────────────────┐
│ Claude Code │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Native MCP Integration (direct protocol, sub-20ms) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ agent-brain-server v2 │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Agentic Query Workflow │ │
│ │ Retrieve → Reflect → Rerank (ColBERTv2) → Synthesize │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Unified PostgreSQL Backend │ │
│ │ pgvector (HNSW) │ tsvector (FTS) │ JSONB (graph) │ │
│ │ + Embedding Cache (content-addressable) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Background Processing │ │
│ │ File Watcher → Job Queue → Incremental Indexer │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Embedding Providers │
│ Voyage 4 (default) │ GraphCodeBERT (code) │ Ollama (offline) │
└─────────────────────────────────────────────────────────────────┘
Key Architectural Changes
| Aspect | Current | Target | Benefit |
|---|---|---|---|
| Protocol | HTTP via subprocess | Native MCP | 5-10x latency reduction |
| Storage | 3 separate systems | Unified PostgreSQL | ACID, simpler ops |
| Retrieval | Single-stage | Two-stage + agentic | +3-4% accuracy |
| Embeddings | OpenAI only | Voyage 4 + code-specific | +14% accuracy |
| Indexing | Blocking, full re-index | Background, incremental | 50-80% faster |
| Queries | Static execution | Agentic workflows | Multi-step reasoning |
6. Risk Analysis
Technical Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| PostgreSQL migration breaks existing indexes | Medium | High | Implement migration tool, keep ChromaDB fallback |
| ColBERTv2 adds unacceptable latency | Low | Medium | Make reranking optional, benchmark first |
| Voyage 4 API stability | Low | Medium | Keep OpenAI as fallback provider |
| MCP protocol changes | Medium | Medium | Pin MCP version, abstract integration |
Operational Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Team bandwidth for all recommendations | High | High | Prioritize P0/P1, defer P2/P3 |
| Breaking changes for existing users | Medium | High | Semantic versioning, migration guides |
| Increased infrastructure complexity | Medium | Medium | PostgreSQL actually simplifies (1 vs 3 systems) |
Dependency Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| LlamaIndex breaking changes | Medium | Medium | Pin versions, maintain fork if needed |
| Voyage AI pricing/availability | Low | Medium | Pluggable provider architecture |
| ColBERTv2 model updates | Low | Low | Version-pin model, benchmark updates |
7. Sources & References
RAG & Reranking
- Two-Stage RAG Framework with Late Interaction and Reranking - January 2026 research on ColBERTv2 + cross-encoder pipelines
- Late Interaction Overview: ColBERT, ColPali, ColQwen - Weaviate's comprehensive guide
- ModernBERT + ColBERT for Biomedical RAG - Performance benchmarks
- Advanced RAG: Hybrid Search and Re-ranking - Implementation patterns
Vector Databases
- Top 9 Vector Databases January 2026 - Shakudo comparison
- Qdrant Benchmarks - Official performance data
- pgvector vs Qdrant - PostgreSQL scaling analysis
- Vector Database Comparison 2025 - Comprehensive feature matrix
Embedding Models
- 13 Best Embedding Models 2026 - Elephas benchmark
- Voyage 4 Model Family - Official Voyage AI announcement
- OpenAI vs Voyage vs Cohere 2026 - Head-to-head comparison
- Best Embedding Models for RAG - ZenML guide
GraphRAG
- Microsoft GraphRAG Documentation - Official docs
- GraphRAG: Unlocking LLM Discovery - Microsoft Research
- GraphRAG Complete Guide 2026 - Meilisearch overview
- LlamaIndex Agentic GraphRAG - Implementation guide
MCP & Integration
- Anthropic Model Context Protocol - Official announcement
- MCP Documentation - Protocol specification
- Claude MCP Integration - January 2026 update
- MCP Donation to Agentic AI Foundation - Foundation establishment
Code Embeddings
- GraphCodeBERT - Data flow-aware code representations
- Code Embedding Guide - Unite.AI overview
- LoRACode: LoRA Adapters for Code - Fine-tuning approaches
Real-Time Indexing
- LiveVectorLake Architecture - January 2026 streaming vector paper
- Real-Time RAG with Striim - Production patterns
- VectraFlow Stream Processing - VLDB paper on incremental updates
- RAG in 2026 for Enterprise AI - Industry trends
LlamaIndex & Agentic RAG
- Goodbye Basic RAG, Hello Agents: 2026 Playbook - Architecture patterns
- Building Knowledge Graph Agents with LlamaIndex - Workflow implementation
- Multi-Agent RAG with LlamaIndex - Memgraph integration
- Agentic RAG with PageRank - Graph ranking
Appendix A: Quick Reference - Priority Matrix
IMPACT
Low Medium High
┌────────┬────────┬────────┐
Low │ │ R14 │ │
├────────┼────────┼────────┤
EFFORT Medium │ R4 │ R7,R10 │ R3,R6 │
├────────┼────────┼────────┤
High │ R11-13 │ R9 │R1,R2,R5│
│ │ │ R8 │
└────────┴────────┴────────┘
Priority:
- P0 (Critical): R1, R2, R3
- P1 (High): R4, R5, R6, R7
- P2 (Medium): R8, R9, R10
- P3 (Lower): R11, R12, R13, R14
Appendix B: Estimated Resource Requirements
| Phase | Duration | Engineering Hours | Infrastructure |
|---|---|---|---|
| Phase 1 | 8 weeks | 280 hours | None (existing) |
| Phase 2 | 8 weeks | 300 hours | MCP test environment |
| Phase 3 | 12 weeks | 400 hours | PostgreSQL instance |
| Phase 4 | 16 weeks | 360 hours | VS Code marketplace |
| Total | 44 weeks | 1,340 hours |
Document prepared by Claude Opus 4.5 based on comprehensive analysis of Agent Brain wiki documentation and 2026 state-of-the-art research.