InfraRecon - Stateful Reconnaissance Pipeline

Complete Terraform infrastructure for an automated reconnaissance pipeline using AWS Step Functions, Batch, Lambda, S3, and DynamoDB.

📋 Overview

This project implements a Stateful Reconnaissance Pipeline that:

Fetches domains from DynamoDB in pages of 4000
HTTPX: Probes domains and captures HTML source code
Diff: Compares results between executions and detects changes
Stores: Saves history in DynamoDB and HTMLs in S3

Architecture

Step Functions (Orchestration)
    │
    ├─→ GetDomains (Lambda) → DynamoDB
    │
    ├─→ PrepareBatches (Lambda) → Divides into batches of 200
    │
    ├─→ BatchMap (20 batches in parallel)
    │   │
    │   ├─→ HTTPXJob (Batch) → S3
    │   │
    │   ├─→ PrepareHTTPXUrls (Lambda) → Divides into chunks of 100
    │   │
    │   ├─→ DiffHTTPXMap (50 chunks in parallel)
    │   │   │
    │   │   └─→ DiffHTTPX (Lambda) → DynamoDB + S3
    │   │
    │   └─→ AggregateBatchResults (Lambda) → Aggregates and cleans up
    │
    └─→ CheckMoreDomains → Loop until all processed

🏗️ Components

AWS Resources Created

S3 Bucket: Stores raw results, HTMLs, and diffs
DynamoDB Table: Manages state (URLs, metadata, history)
AWS Batch: Executes HTTPX in Fargate Spot containers
Lambda Functions: Process results and perform diff
Step Functions: Orchestrates the complete pipeline
EventBridge (optional): Automatic scheduling
CodeBuild (optional): Automated Docker image build in the cloud

🚀 Installation and Deployment

Prerequisites

Terraform >= 1.0
AWS CLI configured
Docker (to build the image)
AWS Account with appropriate permissions

Step 1: Configure Variables

# Copy example file
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
vim terraform.tfvars

Example terraform.tfvars:

aws_region = "us-east-1"
project_name = "infrarecon"
domain = "example.com"

# S3 bucket name (must be globally unique)
s3_bucket_name = ""

# ECR repository URL (will be created in next step)
ecr_repository_url = ""

# Batch configuration
batch_max_vcpus = 128
batch_vcpu = 1.0
batch_memory = 2048

# Automatic scheduling (optional)
enable_schedule = false
schedule_expression = "cron(0 2 * * ? *)"

Step 2: Build and Publish Docker Image

You can build the Docker image locally or in the cloud using AWS CodeBuild.

Option A: Local Build

# Use automated script
./scripts/build-and-push-docker.sh

# Or manually:
cd src/docker
docker build -t recon-tools:latest .

# Create ECR repository
aws ecr create-repository --repository-name recon-tools --region us-east-1

# Login to ECR
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com

# Tag and push
docker tag recon-tools:latest ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest
docker push ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest

Option B: Cloud Build (CodeBuild)

Terraform automatically creates a CodeBuild project that builds and publishes the Docker image to ECR. This option is useful when you don't have Docker installed locally or want to automate the build.

Cloud Build Advantages:

No need to have Docker installed locally
Consistent builds in isolated environment
Easy CI/CD integration
Centralized logs in CloudWatch

Steps:

# 1. First, apply Terraform (this creates the CodeBuild project and ECR repository)
cd terraform
terraform apply

# 2. Execute the build in the cloud
cd ..
./scripts/run-cloud-build.sh

The run-cloud-build.sh script:

Packages the code (src/docker and buildspec.yml)
Uploads to S3 (s3://{bucket}/build/source.zip)
Starts the build in CodeBuild
CodeBuild automatically:
- Logs in to ECR
- Builds the Docker image using buildspec.yml
- Pushes to ECR with latest tag and commit hash
- Logs available in CloudWatch

Monitor the build:

AWS Console: CodeBuild → Build projects → {project_name}-docker-build
Or use the link provided by the script

Note: The ECR repository is automatically created by Terraform. The URL will be:

{account_id}.dkr.ecr.{region}.amazonaws.com/recon-tools:latest

⚠️ IMPORTANT: After the build (local or cloud), copy the image URL and update terraform.tfvars:

ecr_repository_url = "123456789.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest"

Step 3: Apply Terraform

cd terraform

# Initialize Terraform
terraform init

# Validate configuration
terraform validate

# See what will be created
terraform plan

# Apply
terraform apply

This will create all necessary resources. Estimated time: 5-10 minutes

Step 4: Execute Pipeline

# Use script
./scripts/run-pipeline.sh

# Or via AWS CLI
STEP_FUNCTION_ARN=$(cd terraform && terraform output -raw step_function_arn)
aws stepfunctions start-execution \
  --state-machine-arn $STEP_FUNCTION_ARN \
  --input '{}' \
  --region us-east-1

Monitor execution:

AWS Console: Step Functions → Executions
Or use the link provided by the script

📊 Pipeline Flow

Complete Diagram

START
  │
  ▼
PrepareGetDomains (page_size = 4000)
  │
  ▼
GetDomains (Lambda) → 4000 domains from DynamoDB
  │
  ▼
CheckDomainsResult
  ├─ error → HandleError
  ├─ no domains → EndPipeline
  └─ with domains → PrepareBatches
                      │
                      ▼
                  PrepareBatches → 20 batches of 200 domains
                      │
                      ▼
                  BatchMap (20 in parallel, ResultPath = null)
                      │
                      └─ For each batch:
                          │
                          ├─ PrepareBatchInput
                          │
                          ├─ HTTPXJob (Batch) → Saves to S3
                          │
                          ├─ PrepareHTTPXUrls → Divides into chunks of 100
                          │
                          ├─ CheckHTTPXPrep
                          │   ├─ error → SkipBatchOnError
                          │   └─ ok → DiffHTTPXMap
                          │
                          ├─ DiffHTTPXMap (50 chunks in parallel, ResultPath = null)
                          │   └─ For each chunk:
                          │       └─ DiffHTTPX (Lambda)
                          │           • Processes 100 URLs
                          │           • Saves to DynamoDB
                          │           • Saves result to S3 (temp/)
                          │
                          └─ AggregateBatchResults
                              • Lists S3 files
                              • Aggregates statistics
                              • Saves summary to S3
                              • Deletes temporary files
                      │
                      ▼
                  CheckMoreDomains
                      ├─ has_more = true → PrepareGetMoreDomains → GetMoreDomains → (loop)
                      └─ Default → EndPipeline

Capabilities

Pagination: 4000 domains per page
Batches: Up to 20 batches in parallel (200 domains each)
Chunks: Up to 50 chunks in parallel (100 URLs each)
Total: Up to 1000 simultaneous chunks (20 × 50)
No size limit: Uses S3 for intermediate results
Automatic cleanup: Deletes temporary files after processing

📦 Data Structure

S3 Bucket

s3://{bucket}/
├── tool=httpx/
│   └── batch={batch_id}/
│       └── dt={date}/
│           └── data.json                    # Raw HTTPX results
│
├── html_source/
│   └── {domain_safe}/
│       └── {hash}.html                      # HTML source code (deduplicated)
│
├── html_diff/
│   └── {domain_safe}/
│       └── {url_hash}_v{version}.diff      # Diff between versions
│
└── temp/
    ├── batch={batch_id}/
    │   └── chunk={chunk_id}/
    │       └── dt={date}/
    │           └── result.json              # Result of each chunk (temporary)
    │
    └── batch_summaries/
        └── batch={batch_id}/
            └── dt={date}/
                └── summary.json            # Aggregated batch summary

DynamoDB Table

Table: {project_name}-ReconState

PK	SK	Attributes
`URL#{url}`	`METADATA`	status_code, content_length, page_title, body_hash, tech, cname, content_type, word_count, cdn_name, cdn_type, domain, first_seen, last_seen, last_updated, change_count, last_change_at
`URL#{url}`	`S3_REF`	s3_path_latest_html, updated_at
`URL#{url}`	`HISTORY#{timestamp}#{version}`	(all fields from previous version), changes_detected, html_diff_s3_key, timestamp, version

🔍 Features

HTTPX Diff

Calculates MD5 hash of HTML
Compares with previous hash in DynamoDB
If different:
- Saves history in DynamoDB
- Calculates textual diff using difflib
- Saves diff to S3
Updates metadata (status_code, title, tech, CDN, etc.)

Domain Management

# Add domains
./scripts/manage-domains.sh add example.com

# List domains
./scripts/manage-domains.sh list

# Remove domain
./scripts/manage-domains.sh remove example.com

Limited Test

# Test with domain limit
./scripts/test-pipeline-limited.sh 10

🧪 Testing

Local Test (Docker)

# Test container locally
docker run -it \
  -e S3_BUCKET=my-bucket \
  -e JOB_TYPE=httpx \
  -e DOMAINS_LIST='["example.com"]' \
  recon-tools:latest \
  httpx -l /tmp/domains.txt -json -sc -title

Lambda Test

# Test Lambda locally
aws lambda invoke \
  --function-name infrarecon-diff-httpx \
  --payload '{"chunk_id": 0, "urls": ["https://example.com"], "batch_id": 0, "job_id": "job-123"}' \
  response.json

📝 Available Scripts

build-and-push-docker.sh - Builds and publishes Docker image locally
run-cloud-build.sh - Builds Docker image in the cloud (CodeBuild)
run-pipeline.sh - Executes the pipeline
manage-domains.sh - Manages domains in DynamoDB
test-pipeline-limited.sh - Tests with domain limit
upload-domains.sh - Bulk upload domains

🔧 Advanced Configuration

Terraform Variables

# VPC and Networking (optional)
vpc_id = ""
subnet_ids = []
security_group_ids = []

# Webhooks (optional)
slack_webhook_url = ""
discord_webhook_url = ""

# Bounty Targets Synchronization (optional)
enable_bounty_sync = false
bounty_sync_schedule = "cron(0 */6 * * ? *)"

Lambda Environment Variables

S3_BUCKET - S3 bucket name
DYNAMODB_TABLE - DynamoDB table name
BATCH_SIZE - Batch size (default: 200)
CHUNK_SIZE - Chunk size (default: 100)
PAGE_SIZE - Page size (default: 4000)

🐛 Troubleshooting

Batch Job Fails

Check logs in CloudWatch: /aws/batch/infrarecon
Check IAM permissions of task role
Check network connectivity (subnets, security groups)

Lambda Timeout

Increase timeout and memory_size in terraform/lambda.tf
Check file sizes in S3
Optimize Python code (cache, async processing)

Step Function Fails

Check logs in CloudWatch: /aws/vendedlogs/states/infrarecon-recon-pipeline
Check input JSON format
Check Step Function IAM permissions
Check if not exceeding 256KB limit (use S3 for intermediate results)

Error: "Size exceeding maximum bytes"

Step Functions has a 256KB limit per state
Solution: Intermediate results are saved to S3
Check if ResultPath = null is configured in Map states

CodeBuild Fails

Check logs in CodeBuild console
Check CodeBuild IAM permissions
Check if ECR repository was created
Check if buildspec.yml is correct

💰 Costs

Cost Estimate (approximate)

Fargate Spot: ~70% cheaper than On-Demand
S3: Hash deduplication saves storage
DynamoDB: Pay-per-request (no provisioning)
Lambda: Pay-per-invocation
Step Functions: Pay-per-state-transition

Optimizations

Adjust batch_max_vcpus as needed
Configure lifecycle policies in S3
Adjust log retention (default: 7 days)
Use Fargate Spot for Batch jobs

🔒 Security

S3 bucket with AES256 encryption
Block public access enabled
IAM roles with least privilege
VPC endpoints recommended for production
Secrets Manager for sensitive credentials

📚 Project Structure

.
├── terraform/          # Infrastructure as code
│   ├── main.tf
│   ├── variables.tf
│   ├── lambda.tf
│   ├── batch.tf
│   ├── stepfunctions.tf
│   ├── dynamodb.tf
│   ├── s3.tf
│   ├── iam.tf
│   └── codebuild.tf   # Automated cloud build
├── src/
│   ├── docker/        # Dockerfile and scripts
│   └── lambdas/       # Lambda functions
├── scripts/           # Helper scripts
├── buildspec.yml      # CodeBuild configuration
└── README.md          # This file

🎯 Next Steps

Configure Notifications
- Add Slack/Discord webhooks to Lambdas
- Configure SNS for alerts
Optimize Costs
- Adjust batch_max_vcpus as needed
- Configure lifecycle policies in S3
- Adjust log retention
Monitoring
- Configure CloudWatch Dashboards
- Create alerts for failures
Security
- Review IAM policies (least privilege)
- Enable VPC endpoints if needed
- Configure DynamoDB backup

📄 License

This project is provided "as is" for educational and research purposes.

🆘 Support

If you encounter issues:

Check logs (CloudWatch)
Check IAM permissions
Check if all resources were created
Consult the Troubleshooting section above

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src		src
terraform		terraform
.gitignore		.gitignore
README.md		README.md
buildspec.yml		buildspec.yml
terraform.tfvars.example		terraform.tfvars.example

Jhounx/infrarecon

Folders and files

Latest commit

History

Repository files navigation