-
Notifications
You must be signed in to change notification settings - Fork 328
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Description
Add a local, vector-indexed pentesting knowledge base that the agent queries before falling back to web search. Curated content from ExploitDB, NVD, OWASP Testing Guide, and tool documentation — faster and more reliable than Tavily for known techniques.
Why this matters
The agent currently relies on Tavily web search for all external knowledge. This has three problems:
- Latency: web search takes 2-5 seconds per query. During a 50-iteration engagement, the agent may search 15-20 times. A local vector DB returns results in <100ms.
- Reliability: web search returns blog posts, Stack Overflow answers, and marketing pages mixed with actual exploit documentation. The agent wastes reasoning tokens parsing irrelevant results. A curated KB returns only verified, actionable content.
- Knowledge gaps on recent or niche CVEs: the LLM's training data has a cutoff. For CVEs published after the cutoff, the agent must web search — but search results for fresh CVEs are often incomplete or contradictory. A regularly updated local KB with NVD data fills this gap.
- Tool documentation accuracy: the agent frequently uses wrong Metasploit module syntax, incorrect sqlmap flags, or invalid Hydra protocol strings. Embedding exact tool documentation (module options, syntax, examples) in a retrievable KB eliminates these errors.
Architecture
Agent needs knowledge → Query local KB first (fast, curated)
↓ sufficient?
YES → use it
NO → fall back to Tavily web search
↓
Merge and deduplicate results
Proposed knowledge sources
| Source | Content | Update frequency |
|---|---|---|
| NVD/CVE database | CVE descriptions, CVSS scores, affected products | Daily (via NVD API) |
| ExploitDB | Exploit code, proof-of-concepts, vulnerability details | Weekly |
| OWASP Testing Guide | Web app testing methodology and techniques | Static (versioned) |
| Metasploit module docs | Module options, targets, payloads, examples | On image build |
| Nuclei template metadata | Template IDs, severity, tags, CVE mappings | On image build |
| Tool manuals | sqlmap, Hydra, nmap flags and usage patterns | Static |
| GTFOBins/LOLBAS | Privilege escalation one-liners per binary | Monthly |
What already exists
- Tavily web search integration (
tools.py:402-481) - Neo4j for graph data (could double as document store)
- MITRE CWE/CAPEC enrichment system with caching
- NVD API key support in
.env.example
What needs to be built
- Vector store setup (FAISS, ChromaDB, or Qdrant — FAISS is simplest, no extra container)
- Document ingestion pipeline for each knowledge source
- Embedding model selection (sentence-transformers or OpenAI embeddings)
-
PentestKnowledgeBaseclass with query + adaptive_retrieve methods - Integration with existing Tavily: local-first, web-fallback pattern
- Update pipeline (scheduled re-ingestion for NVD, ExploitDB)
- Docker volume for persistent vector index
- Tool documentation extraction script (parse Metasploit module info, sqlmap --help, etc.)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Projects
Status
Up for grabs