Cyber Threat Defense World Modeling System
Bandjacks is a comprehensive cyber threat intelligence (CTI) system that:
- Extracts MITRE ATT&CK techniques from threat reports in 12-40 seconds
- Builds a knowledge graph of threat actors, techniques, and defenses
- Generates STIX 2.1 compliant bundles with full provenance tracking
- Integrates D3FEND ontology for defensive recommendations
- Provides vector search and graph analytics capabilities
- Computes co-occurrence analytics to identify technique patterns
- Features 94% faster extraction than earlier versions with LLM response caching
- Includes a Next.js frontend for report review and analytics visualization
| Guide | Description |
|---|---|
| Quick Start | Get running in 5 minutes |
| Full Setup | Complete environment setup |
| CLI Usage | Command-line interface guide |
| API Reference | REST API documentation |
| Co-occurrence Analytics | Analytics documentation |
| AttackFlow Generation | Flow generation guide |
| Review System | Human-in-the-loop review |
- In-memory cache of all MITRE ATT&CK techniques loaded at startup
- O(1) lookups by external_id (e.g., T1557) for instant name resolution
- 1376 techniques cached with full metadata (name, description, tactics, platforms)
- Consistent naming ensures review UI always shows human-readable technique names
- In-memory cache of all intrusion sets and threat actors
- Fast lookups for actor name resolution and search
- Supports alias matching and fuzzy search
- Python 3.11+
- Neo4j 5.x (graph database)
- OpenSearch 2.x (vector store)
- Redis (optional, for caching)
- Node.js 18+ (for frontend)
- API keys for LLM access (Gemini or OpenAI)
# Clone the repository
git clone https://github.com/yourusername/bandjacks.git
cd bandjacks
# Install Python dependencies with uv (recommended)
uv sync
# Or with pip
pip install -e .
# Install frontend dependencies
cd ui && npm install && cd ..IMPORTANT: You must configure environment variables before starting the application. The application requires NEO4J_PASSWORD to be set.
Create a .env file in the project root:
# Copy the sample file
cp infra/env.sample .env
# Edit .env and set your actual passwords
nano .envRequired configuration in .env:
# Neo4j Configuration (REQUIRED)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-actual-neo4j-password # MUST BE SET - no default provided
# OpenSearch Configuration
OPENSEARCH_URL=http://localhost:9200
OPENSEARCH_USER=admin
OPENSEARCH_PASSWORD=your-opensearch-password # Optional if security is disabled
# LLM Configuration (Required for LLM features)
PRIMARY_LLM=gemini
GOOGLE_API_KEY=your-gemini-api-key
# Optional: OpenAI as fallback
OPENAI_API_KEY=your-openai-api-key
# ATT&CK Configuration
ATTACK_INDEX_URL=https://raw.githubusercontent.com/mitre-attack/attack-stix-data/master/index.json
ATTACK_COLLECTION=enterprise-attack
ATTACK_VERSION=latest
# Redis (optional, for caching)
REDIS_URL=redis://localhost:6379Note: The application will fail to start if NEO4J_PASSWORD is not set. See Environment Variables Fix for details.
# Start the FastAPI backend server
uv run uvicorn bandjacks.services.api.main:app --reload --port 8000
# In another terminal, start the Next.js frontend
cd ui && npm run dev
# Access the applications
open http://localhost:8000/docs # API documentation
open http://localhost:3000 # Frontend UIBandjacks includes a comprehensive CLI for threat intelligence operations:
# Show all available commands
uv run python -m bandjacks.cli.main --helpNote: The CLI requires environment variables to be set (NEO4J_PASSWORD, etc.). Run from the project root where
.envis located.
# Search for threat intelligence
uv run python -m bandjacks.cli.main query search "ransomware encryption techniques" --top-k 10
# Explore graph relationships
uv run python -m bandjacks.cli.main query graph "attack-pattern--abc123" --depth 2# Show review queue
uv run python -m bandjacks.cli.main review queue --status pending --limit 20
# Approve a candidate
uv run python -m bandjacks.cli.main review approve "candidate-123" --reviewer analyst-1
# Reject with reason
uv run python -m bandjacks.cli.main review reject "candidate-456" --reviewer analyst-1 --reason "False positive"# Extract CTI from a document
uv run python -m bandjacks.cli.main extract document ./report.pdf --confidence-threshold 80 --show-evidenceNote: Analytics commands require
AttackEpisodedata in Neo4j to return results.
# Show top co-occurring technique pairs
uv run python -m bandjacks.cli.main analytics top-cooccurrence --limit 25 --min-episode-size 2
# Compute conditional co-occurrence P(B|A) for a technique
uv run python -m bandjacks.cli.main analytics conditional "attack-pattern--abc123" --limit 25
# Analyze a specific threat actor
uv run python -m bandjacks.cli.main analytics actor "intrusion-set--xyz789" --metric npmi
# Extract technique bundles
uv run python -m bandjacks.cli.main analytics bundles --min-support 3 --min-size 3 --max-size 5 --format json --output bundles.json
# Global co-occurrence metrics
uv run python -m bandjacks.cli.main analytics global --min-support 2 --limit 50 --format csv --output pairs.csv# Process a directory of reports with analytics
uv run python -m bandjacks.cli.main workflow process-reports ./reports/ --workers 3 --analyze --export-dir ./results/
# Bulk export all analytics data
uv run python -m bandjacks.cli.main workflow bulk-export --export-dir ./analytics_export/# Check system health
uv run python -m bandjacks.cli.main admin health
# View cache statistics
uv run python -m bandjacks.cli.main admin cache-stats
# Clear cache
uv run python -m bandjacks.cli.main admin cache-clear --pattern "search:*"
# Optimize database
uv run python -m bandjacks.cli.main admin optimizeThe Next.js frontend provides a modern interface for working with the system.
- Report List: View all ingested reports with status and technique counts
- New Report (
/reports/new): Upload PDF/TXT files or paste report content - Report Detail (
/reports/[id]): View extracted techniques, entities, and evidence - Review Interface (
/reports/[id]/review): Human-in-the-loop review workflow
Note: These pages require
AttackEpisodedata in Neo4j. Process reports through the extraction pipeline first, or usePOST /v1/flows/buildto generate episodes from intrusion set data.
- Hub Page: Overview with episode/technique/actor counts
- Top Pairs (
/pairs): Co-occurring technique pairs with NPMI/Lift metrics - Conditional (
/conditional): P(B|A) conditional probabilities - Bundles (
/bundles): Frequently co-occurring technique bundles - Actors (
/actors): Actor-specific technique patterns - Bridging (
/bridging): Techniques used across multiple actors
- Real-time health status of all components (Neo4j, OpenSearch, Redis)
- Cache statistics and memory usage
- Kubernetes-compatible health endpoints
cd ui
npm run dev # Development mode with hot reload
npm run build # Production build
npm run start # Start production server
# Ensure backend is running
# API_URL defaults to http://localhost:8000/v1First, load the MITRE ATT&CK framework into your knowledge graph:
# Load the latest enterprise ATT&CK release
curl -X POST "http://localhost:8000/v1/stix/load/attack" \
-H "Content-Type: application/json" \
-d '{
"collection": "enterprise-attack",
"version": "latest",
"adm_strict": false
}'Extract MITRE ATT&CK techniques from threat intelligence reports:
import httpx
import time
# For small reports (<5KB) - synchronous processing
response = httpx.post(
"http://localhost:8000/v1/reports/ingest",
json={
"content": "APT29 used spearphishing emails with malicious attachments...",
"title": "APT29 Campaign Analysis",
"config": {
"use_optimized_extractor": True,
"span_score_threshold": 0.7,
"top_k": 5
}
}
)
result = response.json()
print(f"Extracted {len(result['extraction']['techniques'])} techniques")
# For large reports (>5KB) - asynchronous processing
response = httpx.post(
"http://localhost:8000/v1/reports/ingest_async",
json={
"content": large_report_text,
"title": "Large Report Analysis"
}
)
job_id = response.json()["job_id"]
# Check job status
status = httpx.get(f"http://localhost:8000/v1/reports/jobs/{job_id}/status")
while status.json()["status"] == "processing":
time.sleep(2)
status = httpx.get(f"http://localhost:8000/v1/reports/jobs/{job_id}/status")
# Get results from completed job
result = status.json()["result"]
print(f"Extracted {result['techniques_count']} techniques in {result['elapsed_time']} seconds")For programmatic access without the API:
from bandjacks.llm.extraction_pipeline import run_extraction_pipeline
# Configure extraction
config = {
"use_optimized_extractor": True, # Use optimized pipeline
"span_score_threshold": 0.7, # Minimum span confidence
"max_spans": 20,
"top_k": 5,
"chunk_size": 2000, # For large documents
"max_chunks": 100
}
# Run extraction pipeline
result = run_extraction_pipeline(
report_text,
config,
source_id="report_123",
neo4j_config=neo4j_config
)
# Access results
techniques = result["techniques"] # Dict of technique_id -> details
bundle = result.get("bundle") # STIX 2.1 bundle if configured
entities = result.get("entities") # Extracted entities
# Example: Print extracted techniques
for tech_id, info in techniques.items():
print(f"{tech_id}: {info['name']}")
print(f" Confidence: {info['confidence']}%")
print(f" Evidence: {info['evidence']}")The Bandjacks extraction pipeline uses a multi-agent architecture to extract structured threat intelligence:
- Detects text spans containing threat behaviors using pattern matching
- Identifies explicit technique IDs (T1566.001) and behavioral patterns
- Scores spans by confidence and deduplicates overlapping detections
- Processes documents in chunks for scalability
- Uses vector search (OpenSearch KNN) to find candidate techniques
- Batch processes all spans in a single LLM call for efficiency
- Extracts ALL relevant techniques per span (not just best match)
- Provides confidence scores and evidence for each mapping
- Merges duplicate techniques found across multiple spans
- Aggregates evidence from different text locations
- Tracks line references for provenance
- Produces final technique list with consolidated confidence scores
- Extracts threat actors, malware, tools, and campaigns
- Uses few-shot prompting with examples for consistency
- Tracks entity mentions with line references
- Identifies aliases and coreferences
- Analyzes temporal markers ("first", "then", "after")
- Infers causal relationships from narrative
- Creates STIX Attack Flow objects with probabilistic edges
- Falls back to co-occurrence modeling when sequence unclear
- Smart Chunking: Documents split into 2KB chunks with overlap
- Batch Processing: Multiple operations combined in single LLM calls
- Parallel Processing: Chunks processed concurrently
- Response Caching: LLM responses cached for 15 minutes
- Early Termination: High-confidence extractions skip verification
- TechniqueCache: All ATT&CK techniques loaded at startup for O(1) lookups
| Document Size | Processing Time | Techniques Extracted |
|---|---|---|
| Small (<5KB) | 10-20 seconds | 5-10 techniques |
| Medium (5-15KB) | 20-40 seconds | 10-15 techniques |
| Large (>15KB) | 30-60 seconds | 15-25 techniques |
Bandjacks provides analytics for understanding technique relationships.
Note: Analytics require
AttackEpisodeandAttackActiondata in Neo4j. These are created when:
- Reports are processed through the extraction pipeline
- Attack flows are built via
/v1/flows/build- STIX bundles with attack episodes are ingested
If no episodes exist, analytics will return empty results.
Compute which techniques frequently appear together across all attack episodes:
# Via API
response = httpx.post(
"http://localhost:8000/v1/analytics/cooccurrence/global",
json={"min_support": 2, "min_episodes_per_pair": 2, "limit": 50}
)
for pair in response.json()["pairs"]:
print(f"{pair['name_a']} + {pair['name_b']}: NPMI={pair['npmi']:.3f}")Calculate P(B|A) - given technique A was used, what's the probability of technique B:
response = httpx.get(
"http://localhost:8000/v1/analytics/cooccurrence/conditional",
params={"technique_id": "attack-pattern--abc123", "limit": 25}
)Identify frequently co-occurring technique bundles (3-5 techniques):
response = httpx.post(
"http://localhost:8000/v1/analytics/cooccurrence/bundles",
json={"min_support": 3, "min_size": 3, "max_size": 5}
)Analyze technique patterns for specific threat actors:
response = httpx.post(
"http://localhost:8000/v1/analytics/cooccurrence/actor",
json={"intrusion_set_id": "intrusion-set--xyz789", "min_support": 1}
)Bandjacks includes a comprehensive review system for validating extracted intelligence:
The review system presents all extracted items in a single interface:
// Review workflow
1. Upload/ingest report → Extraction pipeline runs
2. Navigate to /reports/{id}/review
3. Review extracted items across three tabs:
- Entities (threat actors, malware, tools)
- Techniques (ATT&CK mappings with evidence)
- Attack Flow (sequenced steps)
4. Take actions on each item:
- Approve: Accept as correct
- Reject: Mark as incorrect
- Edit: Modify details (name, confidence, etc.)
5. Submit all decisions atomically- Evidence Links: Direct links to source text with line numbers
- Confidence Adjustment: Modify confidence scores based on analyst knowledge
- Bulk Operations: Select multiple items for batch approve/reject
- Keyboard Shortcuts: A (approve), R (reject), E (edit), Space (next)
- Progress Tracking: Visual indicators of review completion
- Filtering: Filter by type, confidence level, or status
# Submit review decisions
response = httpx.post(
f"http://localhost:8000/v1/reports/{report_id}/unified-review",
json={
"decisions": [
{
"item_id": "technique-0",
"action": "approve",
"confidence_adjustment": 5,
"notes": "Confirmed via external CTI"
},
{
"item_id": "entity-malware-1",
"action": "edit",
"edited_value": {
"name": "Corrected Malware Name",
"confidence": 95
}
}
],
"global_notes": "Review completed by analyst-1"
}
)
# Review creates:
# - Approved entities as Neo4j nodes
# - Technique-to-report relationships
# - Audit trail of decisionsSearch for ATT&CK techniques using natural language:
# Vector search for similar techniques
response = httpx.post(
"http://localhost:8000/v1/search/ttx",
json={
"query": "ransomware that encrypts files and demands payment",
"top_k": 5
}
)
techniques = response.json()["results"]
for tech in techniques:
print(f"{tech['external_id']}: {tech['name']} (score: {tech['score']:.2f})")Query the knowledge graph for relationships:
# Get all techniques used by a specific group
response = httpx.get(
"http://localhost:8000/v1/graph/group/G0016/techniques"
)
# Get defensive techniques for an attack
response = httpx.get(
"http://localhost:8000/v1/defense/technique/T1566.001"
)Create co-occurrence models that show how threat actors use techniques together:
# Generate flow for a specific intrusion set (e.g., APT29)
response = httpx.post(
"http://localhost:8000/v1/flows/build",
json={
"intrusion_set_id": "intrusion-set--899ce53f-13a0-479b-a0e4-67d46e241542"
}
)
flow = response.json()
print(f"Generated flow '{flow['name']}' with {len(flow['steps'])} techniques")
print(f"Co-occurrence edges: {len(flow['edges'])}")Bulk Generation: Generate flows for all threat actors with techniques:
# Run the bulk generation script
uv run python scripts/build_intrusion_flows_simple.py
# Monitor progress - creates flows for 165+ intrusion sets
# Handles rate limiting automatically
# Skips existing flows to avoid duplicatesAttackFlow models use co-occurrence rather than sequential ordering since intrusion sets don't have inherent sequence information. Techniques are connected by:
- Intra-tactic edges: Between techniques in the same kill chain tactic
- Cross-tactic edges: Between techniques across adjacent tactics
- Hub-spoke patterns: For large technique sets to avoid edge explosion
See the AttackFlow Generation Guide for detailed usage.
The extraction pipeline supports multiple input formats:
- Plain Text - Direct text content
- Markdown - Formatted markdown documents
- PDF - Via pdfplumber extraction
- HTML - Via BeautifulSoup parsing
- JSON - Structured data extraction
# Direct text extraction
plaintext_report = """
The threat actors used spearphishing emails with malicious attachments.
After gaining access, they deployed Mimikatz to harvest credentials and
used RDP for lateral movement across the network.
"""
result = asyncio.run(run_agentic_v2_async(plaintext_report, {
"cache_llm_responses": True,
"single_pass_threshold": 500
}))# Markdown document extraction
markdown_report = """
# APT Campaign Analysis
## Attack Methods
- **Initial Access**: Spearphishing with malicious Office documents
- **Execution**: PowerShell scripts and scheduled tasks
- **Persistence**: Registry modifications and service installation
## Tools Used
| Tool | Purpose |
|------|---------|
| Mimikatz | Credential dumping |
| PsExec | Remote execution |
| Cobalt Strike | C2 communications |
"""
result = run_extraction_pipeline(markdown_report, {
"use_optimized_extractor": True,
"span_score_threshold": 0.7
}, source_id="markdown_report")import pdfplumber
from bandjacks.llm.extraction_pipeline import run_extraction_pipeline
# Read PDF with pdfplumber (recommended)
with pdfplumber.open("threat_report.pdf") as pdf:
text = ""
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
# Extract techniques using extraction pipeline
result = run_extraction_pipeline(text, {
"use_optimized_extractor": True,
"span_score_threshold": 0.7,
"chunk_size": 2000
}, source_id="threat_report")
print(f"Found {len(result['techniques'])} techniques")from pathlib import Path
import json
reports_dir = Path("./reports")
results = []
for pdf_file in reports_dir.glob("*.pdf"):
# Extract text and techniques
# ... (see above)
results.append({
"file": pdf_file.name,
"techniques": list(result["techniques"].keys()),
"count": len(result["techniques"])
})
# Save summary
with open("extraction_summary.json", "w") as f:
json.dump(results, f, indent=2)# Generate attack flow from extracted techniques
response = httpx.post(
"http://localhost:8000/v1/flows/build",
json={
"source_id": "report-123",
"technique_ids": ["T1566.001", "T1059.001", "T1003.001"]
}
)
flow = response.json()
print(f"Generated flow with {len(flow['steps'])} steps")Run the test suite to verify your installation:
# Run all tests
uv run pytest
# Test extraction pipeline
python tests/test_optimized_extraction.py
# Test graph integration
python tests/test_graph_upsert.py
# Test STIX validation
python tests/test_bundle_validation.py
# Run frontend tests
cd ui && npm testPOST /v1/stix/load/attack- Load MITRE ATT&CK dataPOST /v1/reports/ingest- Synchronous report ingestion (<5KB)POST /v1/reports/ingest_async- Asynchronous report ingestion (>5KB)POST /v1/reports/ingest/upload- Upload PDF/TXT filesGET /v1/reports/jobs/{id}/status- Check job statusPOST /v1/reports/{id}/unified-review- Submit review decisionsPOST /v1/search/ttx- Search for techniquesGET /v1/graph/technique/{id}- Get technique details
POST /v1/flows/build- Generate AttackFlow co-occurrence modelsGET /v1/flows/{flow_id}- Retrieve specific AttackFlow detailsPOST /v1/flows/search- Search for similar attack flows
GET /v1/analytics/cooccurrence/global- Global co-occurrence metricsGET /v1/analytics/cooccurrence/conditional- Conditional probabilitiesGET /v1/analytics/cooccurrence/bundles- Technique bundlesGET /v1/analytics/cooccurrence/actor- Actor-specific patternsGET /v1/coverage/gaps- Technique coverage gaps
GET /v1/defense/technique/{id}- Get defensive recommendationsGET /v1/detections/technique/{id}- Detection strategiesPOST /v1/sigma/validate- Validate Sigma rules
GET /health- Basic health checkGET /health/live- Kubernetes liveness probeGET /health/ready- Kubernetes readiness probeGET /health/components/{component}- Individual component healthGET /v1/cache/stats- Get LLM cache statisticsPOST /v1/cache/clear- Clear LLM cacheGET /v1/compliance/report- Compliance metricsGET /v1/drift/status- Drift detection statusGET /v1/ml-metrics/performance- ML model metrics
GET /v1/actors- List threat actorsGET /v1/actors/{id}- Get actor detailsGET /v1/provenance/{object_id}- Object provenanceGET /v1/provenance/{object_id}/lineage- Full lineage chainGET /v1/provenance/{object_id}/evidence- Evidence snippets
These endpoints are fully functional but are accessed via REST API only (no frontend pages or CLI commands):
POST /v1/simulation/paths- Simulate attack paths from starting technique/groupPOST /v1/simulation/predict- Predict next likely techniques given current statePOST /v1/simulation/whatif- What-if analysis for defensive scenariosPOST /v1/simulation/scenario- Simulate from sets of groups/software/techniquesGET /v1/simulation/statistics/{technique_id}- Technique usage statisticsGET /v1/simulation/groups/{group_id}/patterns- Group attack patternsPOST /v1/simulation/compare- Compare multiple attack paths
POST /v1/simulate/rollout- PTG rollout simulationPOST /v1/simulate/mdp- Compute MDP optimal defense policyGET /v1/simulate/models- List available PTG models
GET /v1/drift/status- Current drift status across all metricsPOST /v1/drift/analyze- Run drift analysis with custom thresholdsGET /v1/drift/alerts- Get active drift alertsPOST /v1/drift/alerts/{alert_id}/acknowledge- Acknowledge alertGET /v1/drift/metrics/{metric_name}- Get specific drift metric
POST /v1/ml-metrics/prediction- Record model prediction for trackingPOST /v1/ml-metrics/review- Record review decision metricsPOST /v1/ml-metrics/coverage-gap- Record coverage gapGET /v1/ml-metrics/performance- Get model performance metricsGET /v1/ml-metrics/dashboard- Export dashboard metrics
GET /v1/notifications/history- Get notification historyPOST /v1/notifications/clear-history- Clear notification historyGET /v1/notifications/config- Get notification configurationPOST /v1/notifications/test- Send test notification
GET /v1/vectors/status- Vector update system statusGET /v1/vectors/metrics- Detailed vector update metricsPOST /v1/vectors/update- Manually trigger vector updatePOST /v1/vectors/process-batch- Force batch processingDELETE /v1/vectors/queue- Clear pending update queueGET /v1/vectors/health- Vector system health check
GET /v1/ignorelist- Get current ignorelist statusPOST /v1/ignorelist/add- Add entity to ignorelistDELETE /v1/ignorelist/remove- Remove entity from ignorelistPOST /v1/ignorelist/reload- Reload ignorelist from disk
GET /v1/review/candidates- List candidate attack patternsPOST /v1/review/candidates- Create candidate patternGET /v1/review/candidates/{id}- Get candidate detailsPOST /v1/review/candidates/{id}/approve- Approve candidatePOST /v1/review/candidates/{id}/reject- Reject candidateGET /v1/review/candidates/{id}/similar- Find similar patternsGET /v1/review/candidates/stats/summary- Candidate statistics
Access the full API documentation at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI JSON: http://localhost:8000/openapi.json
bandjacks/
├── bandjacks/
│ ├── analysis/ # Graph analysis & interdiction
│ │ ├── graph_analyzer.py
│ │ └── interdiction.py
│ ├── analytics/ # Co-occurrence & clustering
│ │ ├── clustering.py
│ │ ├── cooccurrence.py
│ │ └── detection_bundles.py
│ ├── cli/ # Command-line interface
│ │ ├── main.py # CLI entry point
│ │ ├── batch_extract.py
│ │ ├── formatters.py
│ │ └── workflows.py
│ ├── config/ # Configuration files
│ │ └── entity_ignorelist.yaml
│ ├── core/ # Core utilities
│ │ ├── cache.py # Redis caching
│ │ ├── connection_pool.py
│ │ └── query_optimizer.py
│ ├── llm/ # Extraction pipeline
│ │ ├── extraction_pipeline.py
│ │ ├── agents_v2.py # Core extraction agents
│ │ ├── chunked_extractor.py
│ │ ├── optimized_chunked_extractor.py
│ │ ├── entity_extractor.py
│ │ ├── flow_builder.py
│ │ ├── cache.py # LLM response caching
│ │ └── experimental/ # Experimental features
│ ├── loaders/ # Data loading & indexing
│ │ ├── attack_catalog.py
│ │ ├── attack_upsert.py
│ │ ├── opensearch_index.py
│ │ ├── hybrid_search.py
│ │ └── sigma_loader.py
│ ├── monitoring/ # Metrics & monitoring
│ │ ├── compliance_metrics.py
│ │ ├── defense_metrics.py
│ │ ├── drift_detector.py
│ │ └── ml_metrics.py
│ ├── services/ # API & services
│ │ ├── api/ # FastAPI application
│ │ │ ├── main.py
│ │ │ ├── routes/ # API route handlers
│ │ │ └── middleware/
│ │ ├── technique_cache.py
│ │ └── actor_cache.py
│ ├── simulation/ # Attack simulation
│ │ ├── attack_simulator.py
│ │ ├── mdp_solver.py
│ │ └── ptg_rollout.py
│ └── store/ # Data stores
│ ├── report_store.py
│ ├── candidate_store.py
│ └── review_store.py
├── ui/ # Next.js frontend
│ ├── app/ # App Router pages
│ │ ├── reports/ # Report management
│ │ ├── analytics/ # Analytics dashboards
│ │ └── health/ # Health monitoring
│ ├── components/ # React components
│ └── hooks/ # Custom React hooks
├── tests/ # Test suite
├── samples/ # Sample reports
├── scripts/ # Utility scripts
└── docs/ # Documentation
-
Extraction Pipeline (
bandjacks/llm/)extraction_pipeline.py- Main extraction orchestratorchunked_extractor.py- Standard chunked processingoptimized_chunked_extractor.py- Advanced optimized processingagents_v2.py- Core extraction agents (SpanFinder, Mapper, Consolidator)entity_extractor.py- Entity recognition agentflow_builder.py- Attack flow generationmemory.py- Shared working memorycache.py- LLM response caching
-
Data Layer (
bandjacks/loaders/)- Neo4j property graph for relationships
- OpenSearch for vector embeddings
- STIX 2.1 data model
-
API Layer (
bandjacks/services/api/)- FastAPI REST endpoints
- WebSocket support for real-time updates
- Comprehensive OpenAPI documentation
-
Frontend (
ui/)- Next.js 15 with App Router
- React Query for data fetching
- Radix UI + Tailwind for components
- ReactFlow for graph visualization
- Extraction Speed: 12-40 seconds per report (94% faster than v1)
- Small Documents: 4-8 seconds with single-pass extraction
- Cache Hit Rate: 87.5% speedup on repeated extractions
- Search: <300ms for vector similarity search
- Graph queries: <100ms for most traversals
The system supports multiple LLMs:
# In your .env or config
PRIMARY_LLM=gemini/gemini-2.5-flash # Recommended
# PRIMARY_LLM=gpt-4o-mini # Alternative
# PRIMARY_LLM=gpt-4-turbo # Higher quality, higher costThe system uses a single high-performance async pipeline with configurable options:
{
"cache_llm_responses": True, # Enable LLM caching (default: True)
"single_pass_threshold": 500, # Max words for single-pass (default: 500)
"early_termination_confidence": 90, # Skip verification above this (default: 90)
"disable_discovery": False, # Disable LLM discovery agent
"max_spans": 20, # Maximum spans to process
"span_score_threshold": 0.7, # Minimum span quality
"top_k": 5 # Candidates per span
}Control extraction quality:
{
"confidence_threshold": 50.0, # Minimum confidence (0-100)
"auto_ingest": True # Auto-add high-confidence results
}The API provides comprehensive health monitoring endpoints for operational oversight and Kubernetes deployments:
# Basic health check (always returns 200 if API is running)
curl http://localhost:8000/health
# Kubernetes liveness probe (process alive check)
curl http://localhost:8000/health/live
# Kubernetes readiness probe (full dependency checks)
curl http://localhost:8000/health/ready
# Individual component health
curl http://localhost:8000/health/components/neo4j
curl http://localhost:8000/health/components/opensearch
curl http://localhost:8000/health/components/redis
curl http://localhost:8000/health/components/caches
curl http://localhost:8000/health/components/system{
"status": "healthy",
"timestamp": "2025-01-28T17:43:30.184036Z",
"version": "1.0.0",
"components": {
"neo4j": {
"status": "healthy",
"latency_ms": 5
},
"opensearch": {
"status": "degraded",
"cluster_status": "yellow",
"indices": {
"attack_nodes": false,
"bandjacks_reports": true
}
},
"redis": {
"status": "healthy",
"latency_ms": 2,
"memory_mb": 1.69
},
"caches": {
"status": "healthy",
"technique_cache": {
"count": 993,
"loaded": true
},
"actor_cache": {
"count": 145,
"loaded": true
}
},
"system": {
"status": "healthy",
"memory": {
"available_gb": 8.84,
"percent_used": 72.4
},
"disk": {
"available_gb": 353.11,
"percent_used": 2.9
},
"cpu": {
"percent_used": 7.7
}
}
}
}- healthy: Component fully operational
- degraded: Partially functional (e.g., missing some indices but operational)
- unhealthy: Component failed or unreachable
For Kubernetes deployments, configure probes as follows:
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 45
periodSeconds: 5The system includes automatic LLM response caching for improved performance:
# Check cache statistics
response = httpx.get("http://localhost:8000/v1/cache/stats")
stats = response.json()
print(f"Cache hit rate: {stats['hit_rate']}")
# Clear cache if needed
httpx.post("http://localhost:8000/v1/cache/clear")Choose a profile based on your needs:
# Fast extraction (4-15 seconds)
fast_config = {
"single_pass_threshold": 1000,
"max_spans": 5,
"skip_verification": True,
"top_k": 3
}
# Balanced (default, 12-40 seconds)
balanced_config = {
"single_pass_threshold": 500,
"max_spans": 10,
"early_termination_confidence": 90,
"top_k": 5
}
# High quality (40-120 seconds)
quality_config = {
"single_pass_threshold": 200,
"max_spans": 20,
"disable_discovery": False,
"min_quotes": 3,
"top_k": 10
}Every extracted entity includes full provenance:
# Get provenance for an object
response = httpx.get(
"http://localhost:8000/v1/provenance/attack-pattern--abc123"
)The system includes a review queue for improving extraction:
# Get next item for review
response = httpx.get("http://localhost:8000/v1/review_queue/next")
# Submit feedback
response = httpx.post(
"http://localhost:8000/v1/feedback/extraction",
json={
"extraction_id": "ext-123",
"correct": True,
"corrections": []
}
)Analyze your threat intelligence coverage:
# Get coverage analysis
response = httpx.get("http://localhost:8000/v1/analytics/coverage")
coverage = response.json()
print(f"Summary: {coverage['summary']}")
for tactic in coverage['tactics']:
print(f" {tactic['tactic']}: {tactic['coverage_percentage']}%")Note: Platform coverage (
_analyze_platforms_coverage) currently returns placeholder data. Tactic and group coverage use real Neo4j queries.
The simulation module provides MDP-based attack path prediction:
# Note: This feature is experimental and may require additional setup
from bandjacks.simulation.attack_simulator import AttackSimulator
from bandjacks.simulation.mdp_solver import MDPSolver
# See bandjacks/simulation/ for implementation detailsThis section provides transparency about the implementation status of various features:
- Report Extraction Pipeline - LLM-based technique extraction works end-to-end
- MITRE ATT&CK Loading - Load enterprise/mobile/ICS ATT&CK data into Neo4j
- Vector Search - OpenSearch-based semantic search for techniques
- Review System - Human-in-the-loop review workflow via API and UI
- Health Monitoring - Component health checks and Kubernetes probes
- CLI Query/Admin Commands - Search, graph traversal, cache management
- Attack Flow Generation - Co-occurrence-based flow building for intrusion sets
- Attack Simulation - MDP-based path simulation via
/simulation/*and/simulate/* - Coverage Reports - JSON reports for executive, technical, tactical, operational views
- Co-occurrence Analytics - Requires
AttackEpisodenodes from report processing - Actor Analytics - Requires episodes attributed to intrusion sets
- Technique Bundles - Requires sufficient episode data for pattern mining
- CLI Analytics Commands - Work but return empty if no episodes exist
These are fully implemented features accessible only via REST API:
- Attack Path Simulation -
/simulation/*routes for path prediction and what-if analysis - MDP Policy Solver -
/simulate/mdpfor optimal defense policy computation - Drift Detection -
/drift/*routes for monitoring data quality drift - ML Metrics -
/ml-metrics/*for tracking model performance over time - Vector Management -
/vectors/*for managing vector embeddings - Entity Ignorelist -
/ignorelist/*for filtering false positive entities - Candidate Patterns -
/review/candidates/*for novel technique candidates - Notifications -
/notifications/*for alert configuration and history - Provenance -
/provenance/*for extraction lineage tracking - Compliance -
/compliance/*for compliance metrics reporting
- PTG (Probabilistic Threat Graph) - Core logic implemented, limited testing
- Judge Integration - LLM-based sequence validation
- Attack Flow Simulator - Flow-based simulation engine
- Sequence Extractor - Extract sequences from flows
The following stub features have been removed from the API:
Platform Coverage Analysis- Was returning hardcoded stub dataTrend Analysis- Was returning random synthetic dataCSV/PDF Report Export- Was returning 501; JSON-only nowGemini Sequence Inference- Was 501 stub; use/sequence/proposeinstead
| Feature Area | Frontend UI | CLI | REST API |
|---|---|---|---|
| Report Management | ✅ | ✅ | ✅ |
| Review Workflow | ✅ | ✅ | ✅ |
| Search (TTX) | ✅ | ✅ | ✅ |
| Co-occurrence Analytics | ✅ | ✅ | ✅ |
| Coverage Analytics | ✅ | - | ✅ |
| Health Monitoring | ✅ | - | ✅ |
| Detections/Sigma | ✅ | - | ✅ |
| Attack Flows | ✅ | - | ✅ |
| Defense Overlay | ✅ | - | ✅ |
| Sequences/PTG | ✅ | - | ✅ |
| Actors | ✅ | - | ✅ |
| Attack Simulation | - | - | ✅ |
| Drift Detection | - | - | ✅ |
| ML Metrics | - | - | ✅ |
| Vector Management | - | - | ✅ |
| Entity Ignorelist | - | - | ✅ |
| Candidate Patterns | - | - | ✅ |
| Notifications | - | - | ✅ |
| Provenance | - | - | ✅ |
| Compliance | - | - | ✅ |
| Page | Status | Notes |
|---|---|---|
/reports |
✅ Working | List, create, view reports |
/reports/[id]/review |
✅ Working | Full review workflow |
/analytics/cooccurrence |
Shows KPIs if episodes exist | |
/analytics/cooccurrence/pairs |
Calls real API | |
/analytics/cooccurrence/bundles |
Calls real API | |
/analytics/cooccurrence/actors |
Calls real API | |
/health |
✅ Working | Real-time health status |
-
OpenSearch connection failed
- Ensure OpenSearch is running:
curl http://localhost:9200 - Check index exists:
curl http://localhost:9200/bandjacks_attack_nodes-v1
- Ensure OpenSearch is running:
-
Neo4j connection failed
- Verify Neo4j is running:
neo4j status - Check that
NEO4J_PASSWORDis set in.envfile - Ensure the password matches your Neo4j instance
- If you see "NEO4J_PASSWORD environment variable is required", you need to set it in your
.envfile
- Verify Neo4j is running:
-
Low extraction recall
- Ensure you're using
agentic_v2method - Check LLM API key is valid
- Verify model name is correct (gemini-2.5-flash)
- Ensure you're using
-
Timeout errors
- Increase timeout settings for large documents
- Consider chunking very large reports
-
Frontend not connecting to API
- Ensure API is running on port 8000
- Check CORS settings in API configuration
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
# Run extraction with debug output
result = run_agentic_v2(text, config)# Unit tests
uv run pytest tests/unit
# Integration tests
uv run pytest tests/integration
# Specific test
uv run pytest tests/test_agentic_v2.py::test_extraction
# With coverage
uv run pytest --cov=bandjacks
# Frontend tests
cd ui && npm test
cd ui && npm run test:coverage- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
uv run pytest - Run linting:
uv run ruff check - Submit a pull request
# Format code
uv run ruff format
# Check linting
uv run ruff check
# Type checking
uv run mypy bandjacks[Your License Here]
- Quick Start: docs/QUICKSTART.md
- Full Setup: docs/SETUP.md
- API Docs: http://localhost:8000/docs (when running)
- GitHub Issues: [Report bugs or request features]
- MITRE ATT&CK® framework
- D3FEND ontology
- STIX 2.1 specification