Building an Autonomous Pentesting AI: Lessons from KENSAI

What we learned building a multi-model security scanning platform from scratch.

When we started building KENSAI, the premise was simple: penetration testing is too expensive, too slow, and too infrequent for modern development cycles. Teams ship code daily but test security quarterly. That gap is where breaches happen.

Two years and 462+ scans later, here's what we've learned about building AI that finds vulnerabilities autonomously — the technical challenges nobody warns you about, the architectural decisions that matter, and the things we got wrong.

The Architecture: Why Multi-Model Matters

The first thing most people get wrong about AI-powered security scanning is assuming you need one really good model. You don't. You need multiple models with different strengths, and you need an orchestration layer smart enough to know when to use which.

Here's why: different language models have fundamentally different reasoning patterns. A model that excels at identifying SQL injection patterns might completely miss business logic flaws. A model trained heavily on code analysis might overlook infrastructure misconfigurations.

KENSAI's scanning engine uses what we call a "panel of experts" architecture:

Target → Reconnaissance → [Model A: Infrastructure]
                        → [Model B: Application Logic]  → Correlation → Report
                        → [Model C: Code Analysis]

Each model generates findings independently. The correlation layer then does something crucial: it chains findings together. A low-severity information disclosure finding from Model A combined with a medium-severity authentication weakness from Model B might constitute a critical attack path that neither model would flag alone.

This isn't ensemble learning in the traditional ML sense. It's closer to how a real red team operates — different specialists examine the same target and then compare notes.

The Cold Start Problem

Training AI to find vulnerabilities requires vulnerable applications. Lots of them. But here's the catch — you can't train on production systems (legal and ethical issues), and synthetic vulnerable applications don't capture the messy reality of real-world code.

We addressed this through a three-layer training approach:

Intentionally vulnerable applications: OWASP WebGoat, DVWA, HackTheBox machines, and custom vulnerable apps we built specifically for edge cases.
Bug bounty data: Anonymized vulnerability reports from public bug bounty programs provided real-world patterns. The signal-to-noise ratio is terrible (lots of duplicates and invalid reports), but the valid findings are gold for training.
Synthetic mutation: We built a system that takes known vulnerability patterns and generates variations — different frameworks, different languages, different architectural patterns. This dramatically expanded our training corpus without requiring new real-world data.

The cold start problem never fully goes away. Every new technology stack, every new framework, every new cloud service creates new attack surface that our models haven't seen. Continuous learning isn't optional — it's the core product requirement.

The Hardest Technical Challenges

Challenge 1: Context Window Limitations

Security scanning requires understanding context across large codebases and infrastructure. A vulnerability might span multiple files, services, or network segments. But language models have finite context windows.

Our solution: hierarchical scanning with progressive detail. The first pass is broad — identify technologies, map the attack surface, flag areas of interest. Subsequent passes zoom into specific areas with full context about the surrounding architecture.

Think of it like a human pentester. You don't read every line of code first. You map the application, identify high-value targets, and then dig deep where it matters.

# Simplified scanning pipeline
async def scan(target):
    # Phase 1: Reconnaissance (broad, fast)
    surface = await map_attack_surface(target)
    
    # Phase 2: Targeted analysis (deep, focused)
    findings = []
    for area in surface.high_value_targets:
        context = await gather_context(area, depth=3)
        result = await analyze_with_models(area, context)
        findings.extend(result)
    
    # Phase 3: Correlation (chain findings)
    chains = await correlate_findings(findings, surface)
    return prioritize(findings + chains)

Challenge 2: False Positive Management

Nothing kills trust in a security tool faster than false positives. If your scanner cries wolf ten times, nobody investigates the eleventh alert — which is the real one.

Our false positive rate in early versions was around 35%. Unacceptable. We got it down to under 8% through three mechanisms:

Verification scanning: When the AI identifies a potential vulnerability, a second pass attempts to verify it — sometimes by generating and safely executing a proof-of-concept, sometimes by analyzing the code path to confirm the vulnerability is reachable.

Confidence scoring: Every finding gets a confidence score based on how many indicators support it. A SQL injection finding supported by error-based responses, time-based confirmation, and code analysis gets a higher score than one based on a single heuristic.

Feedback loops: When users mark findings as false positives, that data feeds back into the model. Over time, the system learns which patterns in specific contexts are genuine versus benign.

Challenge 3: Safe Exploitation

This is the ethical tightrope of autonomous pentesting. To verify a vulnerability, you sometimes need to exploit it. But autonomous exploitation of production systems without human oversight is a liability nightmare.

We drew a clear line: KENSAI's scanning engine can perform non-destructive verification (sending payloads that confirm a vulnerability without causing damage) but stops short of full exploitation. For anything that could modify data or affect availability, the system generates a detailed proof-of-concept that a human can execute in a controlled environment.

This is a philosophical decision as much as a technical one. The cybersecurity community has strong opinions about automated exploitation, and rightfully so. We chose to optimize for trust over capability.

Challenge 4: Scan Performance at Scale

A thorough security scan can take hours. Modern development teams expect feedback in minutes. Balancing thoroughness with speed is an ongoing challenge.

Our approach: tiered scanning profiles.

Quick scan (2-5 minutes): External attack surface, known CVEs, configuration issues. Good for CI/CD pipelines.
Standard scan (15-30 minutes): Comprehensive vulnerability assessment including application-level testing. Good for weekly or sprint-based scanning.
Deep scan (1-2 hours): Full autonomous pentesting with multi-model analysis, chained attack paths, and business logic testing. Good for pre-release or regulatory compliance.

The quick scan catches 70% of critical issues in 5% of the time. For most teams, running quick scans continuously and deep scans weekly provides excellent coverage.

What We Got Wrong

Over-engineering the first version. We spent three months building a sophisticated orchestration system before validating that our core scanning models actually produced useful results. Should have started with a single model, one scan type, and iterated from user feedback.

Ignoring the reporting problem. Finding vulnerabilities is half the battle. Communicating them in a way that developers actually fix them is the other half. Our early reports were technically accurate but practically useless — too much jargon, no prioritization, no remediation guidance. We rewrote the reporting engine three times.

Underestimating infrastructure costs. Running multiple large language models for security analysis is computationally expensive. Our initial pricing model didn't account for the actual cost of deep scans. We learned this lesson with our bank account.

Assuming developers want security tools. They don't. They want to ship features. Security tools that add friction get disabled. We redesigned the entire UX around minimal friction — one-click scans, results in Slack, auto-created tickets, fix suggestions in PR comments.

The Current State

As of early 2026, KENSAI has processed over 462 scans across organizations ranging from two-person startups to enterprise customers. The platform supports 11 languages in its interface (targeting the DACH and EU markets specifically) and maps findings to compliance frameworks including NIS2, DORA, and GDPR.

The scan results are available at kensai.app — you can run a free external scan to see the output format and finding quality for yourself.

What's Next

Three areas we're actively investing in:

Supply chain scanning. Most organizations have no visibility into the security posture of their dependencies. We're building capabilities to recursively scan dependency trees and flag transitive risks.
Remediation automation. Finding vulnerabilities is solved (mostly). Fixing them is the bottleneck. We're working on AI-generated patches that developers can review and merge — turning findings into pull requests automatically.
Continuous compliance monitoring. NIS2, DORA, and the EU AI Act are creating massive compliance overhead. Mapping security findings to regulatory requirements in real-time eliminates the audit scramble.

Takeaways for Builders

If you're building in the AI security space, here's what I'd tell you:

Multi-model beats single-model. Every time. The diversity of reasoning approaches catches what any individual model misses.
Trust is your product. False positives destroy trust faster than missed vulnerabilities. Optimize for precision before recall.
Meet developers where they are. CLI tools, IDE plugins, CI/CD integrations, Slack bots. Nobody opens a separate security dashboard voluntarily.
Compliance is a feature, not a product. Regulations create urgency. Use that urgency to drive adoption of tools that provide genuine security value.
Start scanning real targets immediately. Synthetic benchmarks are meaningless. Real-world applications are messy, inconsistent, and surprising in ways that matter.

The autonomous pentesting space is still early. The tools are getting better fast, but they're not replacing human security researchers anytime soon. The sweet spot is augmentation — AI handles the breadth, humans handle the depth.

Build for that.

KENSAI is an autonomous security scanning platform. Run a free scan and see what we find.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building an Autonomous Pentesting AI: Lessons from KENSAI

The Architecture: Why Multi-Model Matters

The Cold Start Problem

The Hardest Technical Challenges

Challenge 1: Context Window Limitations

Challenge 2: False Positive Management

Challenge 3: Safe Exploitation

Challenge 4: Scan Performance at Scale

What We Got Wrong

The Current State

What's Next

Takeaways for Builders

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Building an Autonomous Pentesting AI: Lessons from KENSAI

The Architecture: Why Multi-Model Matters

The Cold Start Problem

The Hardest Technical Challenges

Challenge 1: Context Window Limitations

Challenge 2: False Positive Management

Challenge 3: Safe Exploitation

Challenge 4: Scan Performance at Scale

What We Got Wrong

The Current State

What's Next

Takeaways for Builders

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages