Skip to content

Jhounx/infrarecon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InfraRecon - Stateful Reconnaissance Pipeline

Complete Terraform infrastructure for an automated reconnaissance pipeline using AWS Step Functions, Batch, Lambda, S3, and DynamoDB.

📋 Overview

This project implements a Stateful Reconnaissance Pipeline that:

  1. Fetches domains from DynamoDB in pages of 4000
  2. HTTPX: Probes domains and captures HTML source code
  3. Diff: Compares results between executions and detects changes
  4. Stores: Saves history in DynamoDB and HTMLs in S3

Architecture

Step Functions (Orchestration)
    │
    ├─→ GetDomains (Lambda) → DynamoDB
    │
    ├─→ PrepareBatches (Lambda) → Divides into batches of 200
    │
    ├─→ BatchMap (20 batches in parallel)
    │   │
    │   ├─→ HTTPXJob (Batch) → S3
    │   │
    │   ├─→ PrepareHTTPXUrls (Lambda) → Divides into chunks of 100
    │   │
    │   ├─→ DiffHTTPXMap (50 chunks in parallel)
    │   │   │
    │   │   └─→ DiffHTTPX (Lambda) → DynamoDB + S3
    │   │
    │   └─→ AggregateBatchResults (Lambda) → Aggregates and cleans up
    │
    └─→ CheckMoreDomains → Loop until all processed

🏗️ Components

AWS Resources Created

  • S3 Bucket: Stores raw results, HTMLs, and diffs
  • DynamoDB Table: Manages state (URLs, metadata, history)
  • AWS Batch: Executes HTTPX in Fargate Spot containers
  • Lambda Functions: Process results and perform diff
  • Step Functions: Orchestrates the complete pipeline
  • EventBridge (optional): Automatic scheduling
  • CodeBuild (optional): Automated Docker image build in the cloud

🚀 Installation and Deployment

Prerequisites

  1. Terraform >= 1.0
  2. AWS CLI configured
  3. Docker (to build the image)
  4. AWS Account with appropriate permissions

Step 1: Configure Variables

# Copy example file
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
vim terraform.tfvars

Example terraform.tfvars:

aws_region = "us-east-1"
project_name = "infrarecon"
domain = "example.com"

# S3 bucket name (must be globally unique)
s3_bucket_name = ""

# ECR repository URL (will be created in next step)
ecr_repository_url = ""

# Batch configuration
batch_max_vcpus = 128
batch_vcpu = 1.0
batch_memory = 2048

# Automatic scheduling (optional)
enable_schedule = false
schedule_expression = "cron(0 2 * * ? *)"

Step 2: Build and Publish Docker Image

You can build the Docker image locally or in the cloud using AWS CodeBuild.

Option A: Local Build

# Use automated script
./scripts/build-and-push-docker.sh

# Or manually:
cd src/docker
docker build -t recon-tools:latest .

# Create ECR repository
aws ecr create-repository --repository-name recon-tools --region us-east-1

# Login to ECR
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com

# Tag and push
docker tag recon-tools:latest ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest
docker push ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest

Option B: Cloud Build (CodeBuild)

Terraform automatically creates a CodeBuild project that builds and publishes the Docker image to ECR. This option is useful when you don't have Docker installed locally or want to automate the build.

Cloud Build Advantages:

  • No need to have Docker installed locally
  • Consistent builds in isolated environment
  • Easy CI/CD integration
  • Centralized logs in CloudWatch

Steps:

# 1. First, apply Terraform (this creates the CodeBuild project and ECR repository)
cd terraform
terraform apply

# 2. Execute the build in the cloud
cd ..
./scripts/run-cloud-build.sh

The run-cloud-build.sh script:

  • Packages the code (src/docker and buildspec.yml)
  • Uploads to S3 (s3://{bucket}/build/source.zip)
  • Starts the build in CodeBuild
  • CodeBuild automatically:
    • Logs in to ECR
    • Builds the Docker image using buildspec.yml
    • Pushes to ECR with latest tag and commit hash
    • Logs available in CloudWatch

Monitor the build:

  • AWS Console: CodeBuild → Build projects → {project_name}-docker-build
  • Or use the link provided by the script

Note: The ECR repository is automatically created by Terraform. The URL will be:

{account_id}.dkr.ecr.{region}.amazonaws.com/recon-tools:latest

⚠️ IMPORTANT: After the build (local or cloud), copy the image URL and update terraform.tfvars:

ecr_repository_url = "123456789.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest"

Step 3: Apply Terraform

cd terraform

# Initialize Terraform
terraform init

# Validate configuration
terraform validate

# See what will be created
terraform plan

# Apply
terraform apply

This will create all necessary resources. Estimated time: 5-10 minutes

Step 4: Execute Pipeline

# Use script
./scripts/run-pipeline.sh

# Or via AWS CLI
STEP_FUNCTION_ARN=$(cd terraform && terraform output -raw step_function_arn)
aws stepfunctions start-execution \
  --state-machine-arn $STEP_FUNCTION_ARN \
  --input '{}' \
  --region us-east-1

Monitor execution:

  • AWS Console: Step Functions → Executions
  • Or use the link provided by the script

📊 Pipeline Flow

Complete Diagram

START
  │
  ▼
PrepareGetDomains (page_size = 4000)
  │
  ▼
GetDomains (Lambda) → 4000 domains from DynamoDB
  │
  ▼
CheckDomainsResult
  ├─ error → HandleError
  ├─ no domains → EndPipeline
  └─ with domains → PrepareBatches
                      │
                      ▼
                  PrepareBatches → 20 batches of 200 domains
                      │
                      ▼
                  BatchMap (20 in parallel, ResultPath = null)
                      │
                      └─ For each batch:
                          │
                          ├─ PrepareBatchInput
                          │
                          ├─ HTTPXJob (Batch) → Saves to S3
                          │
                          ├─ PrepareHTTPXUrls → Divides into chunks of 100
                          │
                          ├─ CheckHTTPXPrep
                          │   ├─ error → SkipBatchOnError
                          │   └─ ok → DiffHTTPXMap
                          │
                          ├─ DiffHTTPXMap (50 chunks in parallel, ResultPath = null)
                          │   └─ For each chunk:
                          │       └─ DiffHTTPX (Lambda)
                          │           • Processes 100 URLs
                          │           • Saves to DynamoDB
                          │           • Saves result to S3 (temp/)
                          │
                          └─ AggregateBatchResults
                              • Lists S3 files
                              • Aggregates statistics
                              • Saves summary to S3
                              • Deletes temporary files
                      │
                      ▼
                  CheckMoreDomains
                      ├─ has_more = true → PrepareGetMoreDomains → GetMoreDomains → (loop)
                      └─ Default → EndPipeline

Capabilities

  • Pagination: 4000 domains per page
  • Batches: Up to 20 batches in parallel (200 domains each)
  • Chunks: Up to 50 chunks in parallel (100 URLs each)
  • Total: Up to 1000 simultaneous chunks (20 × 50)
  • No size limit: Uses S3 for intermediate results
  • Automatic cleanup: Deletes temporary files after processing

📦 Data Structure

S3 Bucket

s3://{bucket}/
├── tool=httpx/
│   └── batch={batch_id}/
│       └── dt={date}/
│           └── data.json                    # Raw HTTPX results
│
├── html_source/
│   └── {domain_safe}/
│       └── {hash}.html                      # HTML source code (deduplicated)
│
├── html_diff/
│   └── {domain_safe}/
│       └── {url_hash}_v{version}.diff      # Diff between versions
│
└── temp/
    ├── batch={batch_id}/
    │   └── chunk={chunk_id}/
    │       └── dt={date}/
    │           └── result.json              # Result of each chunk (temporary)
    │
    └── batch_summaries/
        └── batch={batch_id}/
            └── dt={date}/
                └── summary.json            # Aggregated batch summary

DynamoDB Table

Table: {project_name}-ReconState

PK SK Attributes
URL#{url} METADATA status_code, content_length, page_title, body_hash, tech, cname, content_type, word_count, cdn_name, cdn_type, domain, first_seen, last_seen, last_updated, change_count, last_change_at
URL#{url} S3_REF s3_path_latest_html, updated_at
URL#{url} HISTORY#{timestamp}#{version} (all fields from previous version), changes_detected, html_diff_s3_key, timestamp, version

🔍 Features

HTTPX Diff

  • Calculates MD5 hash of HTML
  • Compares with previous hash in DynamoDB
  • If different:
    • Saves history in DynamoDB
    • Calculates textual diff using difflib
    • Saves diff to S3
  • Updates metadata (status_code, title, tech, CDN, etc.)

Domain Management

# Add domains
./scripts/manage-domains.sh add example.com

# List domains
./scripts/manage-domains.sh list

# Remove domain
./scripts/manage-domains.sh remove example.com

Limited Test

# Test with domain limit
./scripts/test-pipeline-limited.sh 10

🧪 Testing

Local Test (Docker)

# Test container locally
docker run -it \
  -e S3_BUCKET=my-bucket \
  -e JOB_TYPE=httpx \
  -e DOMAINS_LIST='["example.com"]' \
  recon-tools:latest \
  httpx -l /tmp/domains.txt -json -sc -title

Lambda Test

# Test Lambda locally
aws lambda invoke \
  --function-name infrarecon-diff-httpx \
  --payload '{"chunk_id": 0, "urls": ["https://example.com"], "batch_id": 0, "job_id": "job-123"}' \
  response.json

📝 Available Scripts

  • build-and-push-docker.sh - Builds and publishes Docker image locally
  • run-cloud-build.sh - Builds Docker image in the cloud (CodeBuild)
  • run-pipeline.sh - Executes the pipeline
  • manage-domains.sh - Manages domains in DynamoDB
  • test-pipeline-limited.sh - Tests with domain limit
  • upload-domains.sh - Bulk upload domains

🔧 Advanced Configuration

Terraform Variables

# VPC and Networking (optional)
vpc_id = ""
subnet_ids = []
security_group_ids = []

# Webhooks (optional)
slack_webhook_url = ""
discord_webhook_url = ""

# Bounty Targets Synchronization (optional)
enable_bounty_sync = false
bounty_sync_schedule = "cron(0 */6 * * ? *)"

Lambda Environment Variables

  • S3_BUCKET - S3 bucket name
  • DYNAMODB_TABLE - DynamoDB table name
  • BATCH_SIZE - Batch size (default: 200)
  • CHUNK_SIZE - Chunk size (default: 100)
  • PAGE_SIZE - Page size (default: 4000)

🐛 Troubleshooting

Batch Job Fails

  1. Check logs in CloudWatch: /aws/batch/infrarecon
  2. Check IAM permissions of task role
  3. Check network connectivity (subnets, security groups)

Lambda Timeout

  1. Increase timeout and memory_size in terraform/lambda.tf
  2. Check file sizes in S3
  3. Optimize Python code (cache, async processing)

Step Function Fails

  1. Check logs in CloudWatch: /aws/vendedlogs/states/infrarecon-recon-pipeline
  2. Check input JSON format
  3. Check Step Function IAM permissions
  4. Check if not exceeding 256KB limit (use S3 for intermediate results)

Error: "Size exceeding maximum bytes"

  • Step Functions has a 256KB limit per state
  • Solution: Intermediate results are saved to S3
  • Check if ResultPath = null is configured in Map states

CodeBuild Fails

  1. Check logs in CodeBuild console
  2. Check CodeBuild IAM permissions
  3. Check if ECR repository was created
  4. Check if buildspec.yml is correct

💰 Costs

Cost Estimate (approximate)

  • Fargate Spot: ~70% cheaper than On-Demand
  • S3: Hash deduplication saves storage
  • DynamoDB: Pay-per-request (no provisioning)
  • Lambda: Pay-per-invocation
  • Step Functions: Pay-per-state-transition

Optimizations

  • Adjust batch_max_vcpus as needed
  • Configure lifecycle policies in S3
  • Adjust log retention (default: 7 days)
  • Use Fargate Spot for Batch jobs

🔒 Security

  • S3 bucket with AES256 encryption
  • Block public access enabled
  • IAM roles with least privilege
  • VPC endpoints recommended for production
  • Secrets Manager for sensitive credentials

📚 Project Structure

.
├── terraform/          # Infrastructure as code
│   ├── main.tf
│   ├── variables.tf
│   ├── lambda.tf
│   ├── batch.tf
│   ├── stepfunctions.tf
│   ├── dynamodb.tf
│   ├── s3.tf
│   ├── iam.tf
│   └── codebuild.tf   # Automated cloud build
├── src/
│   ├── docker/        # Dockerfile and scripts
│   └── lambdas/       # Lambda functions
├── scripts/           # Helper scripts
├── buildspec.yml      # CodeBuild configuration
└── README.md          # This file

🎯 Next Steps

  1. Configure Notifications

    • Add Slack/Discord webhooks to Lambdas
    • Configure SNS for alerts
  2. Optimize Costs

    • Adjust batch_max_vcpus as needed
    • Configure lifecycle policies in S3
    • Adjust log retention
  3. Monitoring

    • Configure CloudWatch Dashboards
    • Create alerts for failures
  4. Security

    • Review IAM policies (least privilege)
    • Enable VPC endpoints if needed
    • Configure DynamoDB backup

📄 License

This project is provided "as is" for educational and research purposes.

🆘 Support

If you encounter issues:

  1. Check logs (CloudWatch)
  2. Check IAM permissions
  3. Check if all resources were created
  4. Consult the Troubleshooting section above

About

AWS Infra to Recon using basic tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published