Build a multi-agent security auditing system where a red-team attacker and blue-team defender iterate on vulnerabilities, and an arbiter validates fixes with tool-driven checks. Output is a professional security report with evidence, patches, and verification results.
Scope (MVP for hackathon):
- Target Python codebases first (FastAPI/Flask-style apps and simple libraries).
- Support a curated set of high-signal vulnerabilities: injection, authz/authn checks, insecure deserialization, path traversal, SSRF, unsafe subprocess, secret leakage, weak crypto usage.
- Provide CLI-driven workflow: scan -> fix -> verify -> report.
- CLI (Typer + Rich) orchestrates the pipeline.
- Agents (OpenAI Agents SDK) implement roles with strict JSON contracts.
- Shared state is a JSON run bundle stored on disk for reproducibility.
- Verification uses local tools (ruff, bandit, pytest) to ground the arbiter.
Directory layout (initial):
src/audit/core packageagents/role definitions + system promptsflows/orchestrators (scan/fix/verify/report)tools/local tooling wrappers (ruff/bandit/pytest, patch apply)contracts/JSON schemas + data models
examples/vuln_app/intentionally vulnerable demotests/unit + integration testsdocs/supporting docs
Red Team (Attacker)
- Finds vulnerabilities with evidence: file path, line range, code snippet, CWE tag, severity, exploit rationale.
- Outputs a structured finding list + proposed test/verification steps.
Blue Team (Defender)
- Proposes minimal patches with rationale.
- Outputs unified diff + updated finding status.
Arbiter (Security Auditor)
- Verifies fixes using tools and the red team’s test plan.
- Accepts/rejects with explicit reasons and remaining risk.
Reporter
- Generates
REPORT.mdwith summary, findings, patches, and verification logs.
- Scan: Red Team analyzes repo and emits findings.
- Fix: Blue Team proposes minimal patches for accepted findings.
- Reattack (optional): Red Team re-scans post-fix code to see what still breaks.
- Verify: Arbiter runs local checks, evaluates reattack output, and accepts/rejects.
- Loop: If rejected, send back to Defender with arbiter feedback (max 2 iterations).
- Report: Reporter generates final report and scorecard.
Key rule: the arbiter’s decision must be tool-grounded (tests/scanners), not just model opinion.
Core object: Finding
{
"id": "F-001",
"cwe": "CWE-89",
"severity": "high",
"file": "src/app/db.py",
"line": 42,
"evidence": "...code snippet...",
"impact": "SQL injection via user input",
"fix_plan": "parameterized query",
"status": "open|fixed|rejected"
}
Run bundle (runs/<run_id>.json) includes:
findings[]patches[](unified diffs)verification[](tool outputs + pass/fail)decision_log[](arbiter decisions)
- ruff: lint + quick static checks
- bandit: security-specific static analysis
- pytest: regression checks on example app
Arbiter must report:
- Which tools ran, exit codes, and failing rules
- Which findings were verified as fixed
# scan
python -m audit scan <path>
# fix (writes patch file)
python -m audit fix <path> --autofix
# verify
python -m audit verify <path>
# full pipeline
python -m audit run <path>
examples/vuln_app/ will include:
- A minimal API with seeded vulnerabilities
- Basic tests that fail pre-fix and pass post-fix
- A known “answer key” of expected findings
- Baseline: single-agent audit pass (no loop).
- Scorecard:
- confirmed findings
- false positives
- fixes applied
- verification pass rate
- time to fix
- baseline delta and overlap
We will show a simple comparison table in the report.
- Hallucinated findings: require evidence snippet + file/line + tool corroboration when possible.
- Over-patching: blue team must minimize diffs; arbiter rejects invasive changes.
- Time: keep scope to Python + curated vulnerabilities; add languages later only if time remains.
- Scaffold repo + CLI skeleton
- Implement scan/fix/verify flows
- Build demo app + tests
- Add report generator + metrics
- Record 2-min demo