Skip to content

Agent Intelligence: RAG-Enhanced Pentesting Knowledge Base #53

@samugit83

Description

@samugit83

Description

Add a local, vector-indexed pentesting knowledge base that the agent queries before falling back to web search. Curated content from ExploitDB, NVD, OWASP Testing Guide, and tool documentation — faster and more reliable than Tavily for known techniques.

Why this matters

The agent currently relies on Tavily web search for all external knowledge. This has three problems:

  1. Latency: web search takes 2-5 seconds per query. During a 50-iteration engagement, the agent may search 15-20 times. A local vector DB returns results in <100ms.
  2. Reliability: web search returns blog posts, Stack Overflow answers, and marketing pages mixed with actual exploit documentation. The agent wastes reasoning tokens parsing irrelevant results. A curated KB returns only verified, actionable content.
  3. Knowledge gaps on recent or niche CVEs: the LLM's training data has a cutoff. For CVEs published after the cutoff, the agent must web search — but search results for fresh CVEs are often incomplete or contradictory. A regularly updated local KB with NVD data fills this gap.
  4. Tool documentation accuracy: the agent frequently uses wrong Metasploit module syntax, incorrect sqlmap flags, or invalid Hydra protocol strings. Embedding exact tool documentation (module options, syntax, examples) in a retrievable KB eliminates these errors.

Architecture

Agent needs knowledge → Query local KB first (fast, curated)
                          ↓ sufficient?
                        YES → use it
                        NO → fall back to Tavily web search
                          ↓
                        Merge and deduplicate results

Proposed knowledge sources

Source Content Update frequency
NVD/CVE database CVE descriptions, CVSS scores, affected products Daily (via NVD API)
ExploitDB Exploit code, proof-of-concepts, vulnerability details Weekly
OWASP Testing Guide Web app testing methodology and techniques Static (versioned)
Metasploit module docs Module options, targets, payloads, examples On image build
Nuclei template metadata Template IDs, severity, tags, CVE mappings On image build
Tool manuals sqlmap, Hydra, nmap flags and usage patterns Static
GTFOBins/LOLBAS Privilege escalation one-liners per binary Monthly

What already exists

  • Tavily web search integration (tools.py:402-481)
  • Neo4j for graph data (could double as document store)
  • MITRE CWE/CAPEC enrichment system with caching
  • NVD API key support in .env.example

What needs to be built

  • Vector store setup (FAISS, ChromaDB, or Qdrant — FAISS is simplest, no extra container)
  • Document ingestion pipeline for each knowledge source
  • Embedding model selection (sentence-transformers or OpenAI embeddings)
  • PentestKnowledgeBase class with query + adaptive_retrieve methods
  • Integration with existing Tavily: local-first, web-fallback pattern
  • Update pipeline (scheduled re-ingestion for NVD, ExploitDB)
  • Docker volume for persistent vector index
  • Tool documentation extraction script (parse Metasploit module info, sqlmap --help, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    Status

    Up for grabs

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions