Skip to content

Commit 6bf3194

Browse files
committed
feat: Introduce core RAG system for code, including agent, indexer, chunker, embedding, and vector store components, with updated service and API integrations.
1 parent 8a63644 commit 6bf3194

31 files changed

Lines changed: 2337 additions & 16 deletions

CHANGELOG.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,27 @@
1-
## [Unreleased] - 2025-12-17
1+
## [2.1.0] - 2025-12-19
2+
3+
**Focus:** Semantic Search & Retrieval Quality
4+
5+
### 🚀 Features
6+
* **Semantic Search**: Implemented dense vector retrieval using FAISS and OpenAI embeddings.
7+
* **Hybrid Retrieval**: Combined BM25 sparse search with dense embeddings using Reciprocal Rank Fusion (RRF).
8+
* **Code Chunking**: Added intelligent code chunking for modules, imports, and entities.
9+
* **Watch Mode**: Integrated file system monitoring for real-time background re-indexing.
10+
* **Dependency Expansion**: Improved context quality by automatically including caller/callee dependencies in search results.
11+
* **New CLI Commands**: Added `knowcode index` and `knowcode semantic-search`.
12+
* **API Enhancement**: Added `/api/v1/context/query` endpoint for rich semantic queries.
13+
14+
### 🐛 Fixes
15+
* Fixed `VectorStore` persistence bug where `id_map` was reset after loading.
16+
* Fixed `Chunker` instability issue where collected chunks were reset mid-parsing.
17+
* Resolved stubbed implementation in `completeness.py`.
18+
19+
### 🏗️ Architectural Impact
20+
* Introduced a new retrieval pipeline: `Indexer` -> `ChunkRepository` -> `VectorStore` -> `HybridIndex` -> `SearchEngine`.
21+
* Added background processing and file monitoring for improved live updates.
22+
23+
---
24+
225

326
**Focus:** Feature Development
427

KnowCode.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ Entity:
135135
id: UUID
136136
kind: function | class | module | config_key | feature_flag | api_endpoint
137137
source_location: Location
138+
embeddings: vector (1536d)
138139
confidence: float (0.0-1.0)
139140
provenance: static_analysis | runtime_trace | llm_inference | human_annotation
140141
created_at: timestamp
@@ -163,6 +164,44 @@ Entity:
163164
164165
---
165166
167+
---
168+
169+
## **4a. [NEW] Semantic Search & Indexing Layer (v2.1)**
170+
171+
### **Purpose**
172+
173+
Enable **retrieval-augmented generation (RAG)** by indexing code semantics in a high-dimensional vector space alongside traditional lexical search.
174+
175+
### **Responsibilities**
176+
177+
* **Chunking**: Break code into logical units (functions, classes, module headers)
178+
* **Embedding**: Generate dense vector representations (e.g., OpenAI text-embedding-3-small)
179+
* **Vector Storage**: Persist vectors for fast nearest-neighbor search
180+
* **Hybrid Retrieval**: Combine dense (vector) and sparse (BM25) search results
181+
* **Reranking**: Optimize results based on metadata, recency, and completeness
182+
* **[HARDENED]** Sliding window chunking with overlap
183+
* **[HARDENED]** Real-time incremental indexing (Watch Mode)
184+
* **[HARDENED]** Dependency-aware result expansion (Completeness)
185+
186+
### **Inputs**
187+
188+
* Code entities from Semantic Graph
189+
* Raw source code
190+
191+
### **Outputs**
192+
193+
* FAISS Vector Index
194+
* In-memory Chunk Repository
195+
* Ranked search results
196+
197+
### **Downstream Consumers**
198+
199+
* API `/context/query` endpoint
200+
* CLI `semantic-search` command
201+
* Context Synthesis Layer
202+
203+
---
204+
166205
## **4\. Static Behavioral Analysis Layer**
167206

168207
### **Purpose**

README.md

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,10 +42,16 @@ knowcode context "MyClass.important_method"
4242
# 4. Export documentation
4343
knowcode export -o docs/
4444

45-
# 5. Start the intelligence server
46-
knowcode server --port 8080
45+
# 5. Build semantic search index
46+
knowcode index src/
47+
48+
# 6. Perform semantic search
49+
knowcode semantic-search "How does parsing work?"
4750

48-
# 6. View statistics
51+
# 7. Start the intelligence server with watch mode
52+
knowcode server --port 8080 --watch
53+
54+
# 8. View statistics
4955
knowcode stats
5056
```
5157

@@ -114,11 +120,31 @@ Show statistics about the knowledge store.
114120
knowcode stats [--store <path>]
115121
```
116122

123+
### `index`
124+
Build a semantic search index for your codebase.
125+
126+
```bash
127+
knowcode index <directory> [--output <path>]
128+
```
129+
130+
### `semantic-search`
131+
Perform a natural language search against the semantic index.
132+
133+
```bash
134+
knowcode semantic-search <query> [--index <path>] [--limit <n>]
135+
```
136+
137+
**Example:**
138+
```bash
139+
knowcode semantic-search "Where is the graph built?"
140+
```
141+
```
142+
117143
### `server`
118144
Start the FastAPI intelligence server. This is the preferred way for locally hosted AI agents (IDEs) to interact with KnowCode.
119145
120146
```bash
121-
knowcode server [--host <host>] [--port <port>] [--store <path>]
147+
knowcode server [--host <host>] [--port <port>] [--store <path>] [--watch]
122148
```
123149

124150
**Example:**
@@ -128,7 +154,8 @@ knowcode server --port 8080
128154

129155
Once running, you can access endpoints like:
130156
- `GET /api/v1/context?target=MyClass`
131-
- `GET /api/v1/search?q=parser`
157+
- `GET /api/v1/search?q=parser` `(lexical search)`
158+
- `POST /api/v1/context/query` `(semantic search)`
132159
- `POST /api/v1/reload` (to refresh data after a new `analyze` run)
133160

134161
## Supported Languages (MVP)
@@ -147,8 +174,9 @@ KnowCode follows a layered architecture:
147174
2. **Parsers** - Language-specific parsing (Python AST, Tree-sitter for others)
148175
3. **Graph Builder** - Constructs semantic graph with entities and relationships
149176
4. **Knowledge Store** - In-memory graph with JSON persistence
150-
5. **Context Synthesizer** - Generates token-efficient context bundles with priority ranking
151-
6. **CLI** - User interface for all operations
177+
5. **Indexer** - Vector embedding and hybrid retrieval engine (FAISS + BM25)
178+
6. **Context Synthesizer** - Generates token-efficient context bundles with priority ranking
179+
7. **CLI** - User interface for all operations
152180

153181
See [KnowCode.md](KnowCode.md) for the complete reference architecture.
154182

@@ -225,9 +253,9 @@ See [KnowCode.md](KnowCode.md) for the full vision. The MVP focuses on:
225253
- ✅ v1.3: Token budget optimization, priority ranking
226254
- ✅ v1.4: Runtime signal integration
227255
- ✅ v2.0: Intelligence Server mode (local API for local IDE agents)
256+
- ✅ v2.1: Semantic search with embeddings, hybrid retrieval, and watch mode
228257

229258
**Future releases:**
230-
- v2.1: Semantic search with embeddings
231259
- v3.0: Team sharing & Enterprise features (RBAC, SSO, etc.)
232260

233261
## License

environment_variables.env

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GOOGLE_API_KEY="AIzaSyCDHxIUW-sHcVmtTLhPJ1rT2C13xqI7Xho"

pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ dependencies = [
1515
"tiktoken>=0.7.0",
1616
"fastapi>=0.100.0",
1717
"uvicorn>=0.22.0",
18+
"openai>=1.0.0",
19+
"faiss-cpu>=1.7.0",
20+
"numpy>=1.24.0",
21+
"watchdog>=3.0.0",
1822
]
1923

2024
[project.scripts]

scripts/evaluate.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
"""Evaluation script for retrieval quality."""
2+
3+
import json
4+
import sys
5+
from pathlib import Path
6+
from knowcode.chunk_repository import InMemoryChunkRepository
7+
from knowcode.vector_store import VectorStore
8+
from knowcode.hybrid_index import HybridIndex
9+
from knowcode.embedding import OpenAIEmbeddingProvider
10+
from knowcode.models import EmbeddingConfig, CodeChunk
11+
12+
def evaluate(ground_truth_path: Path, index_path: Path) -> dict:
13+
"""Evaluate retrieval quality against ground truth."""
14+
if not ground_truth_path.exists():
15+
return {"error": "Ground truth file not found"}
16+
17+
with open(ground_truth_path) as f:
18+
ground_truth = json.load(f)
19+
20+
# Load index components
21+
repo = InMemoryChunkRepository()
22+
# Assuming index_path is directory containing chunks.json and vectors used by Indexer.load
23+
# Note: Indexer.load logic:
24+
# chunks_file = path / "chunks.json"
25+
# vector_path = path / "vectors"
26+
27+
chunks_file = index_path / "chunks.json"
28+
if chunks_file.exists():
29+
with open(chunks_file) as f:
30+
data = json.load(f)
31+
for c_data in data["chunks"]:
32+
repo.add(CodeChunk(**c_data))
33+
34+
vs = VectorStore(dimension=1536, index_path=index_path / "vectors")
35+
# Note: We need a real provider for queries, or mock if vectors are precomputed?
36+
# For evaluation we assume we have an API key or use the same provider used for indexing.
37+
# Here we assume OpenAI.
38+
try:
39+
provider = OpenAIEmbeddingProvider(EmbeddingConfig())
40+
except:
41+
print("Skipping evaluation: No OpenAI API Key found")
42+
return {}
43+
44+
hybrid = HybridIndex(repo, vs)
45+
46+
# Metrics
47+
hits_at_5 = 0
48+
hits_at_10 = 0
49+
mrr_sum = 0.0
50+
total_queries = len(ground_truth)
51+
52+
for item in ground_truth:
53+
query = item.get("query")
54+
expected_ids = set(item.get("expected_ids", []))
55+
56+
if not query or not expected_ids:
57+
continue
58+
59+
q_vec = provider.embed_single(query)
60+
# Search directly on hybrid index (skipping SearchEngine wrapper for raw retrieval eval)
61+
results = hybrid.search(query, q_vec, limit=10)
62+
63+
found_ids = [c.id for c, _ in results]
64+
65+
# Recall@k
66+
if any(fid in expected_ids for fid in found_ids[:5]):
67+
hits_at_5 += 1
68+
if any(fid in expected_ids for fid in found_ids[:10]):
69+
hits_at_10 += 1
70+
71+
# MRR
72+
rank = 0
73+
for i, fid in enumerate(found_ids):
74+
if fid in expected_ids:
75+
rank = i + 1
76+
break
77+
if rank > 0:
78+
mrr_sum += 1.0 / rank
79+
80+
return {
81+
"precision_at_5": hits_at_5 / total_queries if total_queries else 0,
82+
"recall_at_10": hits_at_10 / total_queries if total_queries else 0,
83+
"mrr": mrr_sum / total_queries if total_queries else 0,
84+
}
85+
86+
87+
if __name__ == "__main__":
88+
if len(sys.argv) < 3:
89+
print("Usage: python evaluate.py <ground_truth.json> <index_dir>")
90+
sys.exit(1)
91+
92+
gt_path = Path(sys.argv[1])
93+
idx_path = Path(sys.argv[2])
94+
print(evaluate(gt_path, idx_path))

src/knowcode/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
"""KnowCode - Transform your codebase into an effective knowledge base."""
22

33
__version__ = "0.1.0"
4+
5+
from knowcode.models import CodeChunk, EmbeddingConfig
6+
from knowcode.chunk_repository import ChunkRepository, InMemoryChunkRepository

src/knowcode/agent.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
"""Agent module for KnowCode."""
2+
3+
import os
4+
from typing import Optional
5+
6+
from openai import OpenAI, OpenAIError
7+
8+
from knowcode.service import KnowCodeService
9+
10+
11+
class Agent:
12+
"""Agent that answers questions about the codebase using an LLM."""
13+
14+
def __init__(self, service: KnowCodeService, model: str = "gpt-4o") -> None:
15+
"""Initialize the agent.
16+
17+
Args:
18+
service: KnowCodeService instance for context retrieval.
19+
model: OpenAI model to use.
20+
"""
21+
self.service = service
22+
self.model = model
23+
api_key = os.environ.get("OPENAI_API_KEY")
24+
if not api_key:
25+
# We allow initialization without key, but answer() will fail if not provided later or found.
26+
# This is to allow CLI to start up even if key is missing (until 'ask' is actually called).
27+
pass
28+
self.client = OpenAI(api_key=api_key) if api_key else None
29+
30+
def answer(self, query: str) -> str:
31+
"""Answer a question about the codebase.
32+
33+
Args:
34+
query: User's question.
35+
36+
Returns:
37+
The agent's answer.
38+
39+
Raises:
40+
ValueError: If OPENAI_API_KEY is not set.
41+
OpenAIError: If the API call fails.
42+
"""
43+
if not self.client:
44+
api_key = os.environ.get("OPENAI_API_KEY")
45+
if not api_key:
46+
raise ValueError("OPENAI_API_KEY environment variable is not set.")
47+
self.client = OpenAI(api_key=api_key)
48+
49+
# 1. Retrieve knowledge
50+
# Simple strategy: Search for keywords in the query to find relevant entities
51+
# then get context for the top match.
52+
# Ideally, we would have a vector store search here. For MVP, we use the graph search.
53+
search_results = self.service.search(query)
54+
55+
context_str = ""
56+
if search_results:
57+
# Get up to 3 relevant entities
58+
top_entities = search_results[:3]
59+
context_parts = []
60+
for entity in top_entities:
61+
try:
62+
# Limit tokens for each to fit in context window comfortably
63+
bundle = self.service.get_context(entity.id, max_tokens=1500)
64+
context_parts.append(bundle["context_text"])
65+
except Exception:
66+
continue
67+
68+
if context_parts:
69+
context_str = "\n\n".join(context_parts)
70+
else:
71+
context_str = "No specific entities found in the codebase matching the query terms."
72+
73+
# 2. Construct Prompt
74+
system_prompt = (
75+
"You are an expert software engineering assistant. "
76+
"You have access to context from the user's codebase. "
77+
"Answer the user's question based strictly on the provided context. "
78+
"If the context doesn't contain the answer, say so, but try to be helpful based on the visible code structures."
79+
)
80+
81+
user_message = f"Context:\n{context_str}\n\nQuestion: {query}"
82+
83+
# 3. Call LLM
84+
response = self.client.chat.completions.create(
85+
model=self.model,
86+
messages=[
87+
{"role": "system", "content": system_prompt},
88+
{"role": "user", "content": user_message},
89+
],
90+
temperature=0.0,
91+
)
92+
93+
return response.choices[0].message.content or "No response from LLM."

0 commit comments

Comments
 (0)