Complete Terraform infrastructure for an automated reconnaissance pipeline using AWS Step Functions, Batch, Lambda, S3, and DynamoDB.
This project implements a Stateful Reconnaissance Pipeline that:
- Fetches domains from DynamoDB in pages of 4000
- HTTPX: Probes domains and captures HTML source code
- Diff: Compares results between executions and detects changes
- Stores: Saves history in DynamoDB and HTMLs in S3
Step Functions (Orchestration)
│
├─→ GetDomains (Lambda) → DynamoDB
│
├─→ PrepareBatches (Lambda) → Divides into batches of 200
│
├─→ BatchMap (20 batches in parallel)
│ │
│ ├─→ HTTPXJob (Batch) → S3
│ │
│ ├─→ PrepareHTTPXUrls (Lambda) → Divides into chunks of 100
│ │
│ ├─→ DiffHTTPXMap (50 chunks in parallel)
│ │ │
│ │ └─→ DiffHTTPX (Lambda) → DynamoDB + S3
│ │
│ └─→ AggregateBatchResults (Lambda) → Aggregates and cleans up
│
└─→ CheckMoreDomains → Loop until all processed
- S3 Bucket: Stores raw results, HTMLs, and diffs
- DynamoDB Table: Manages state (URLs, metadata, history)
- AWS Batch: Executes HTTPX in Fargate Spot containers
- Lambda Functions: Process results and perform diff
- Step Functions: Orchestrates the complete pipeline
- EventBridge (optional): Automatic scheduling
- CodeBuild (optional): Automated Docker image build in the cloud
- Terraform >= 1.0
- AWS CLI configured
- Docker (to build the image)
- AWS Account with appropriate permissions
# Copy example file
cp terraform.tfvars.example terraform.tfvars
# Edit with your values
vim terraform.tfvarsExample terraform.tfvars:
aws_region = "us-east-1"
project_name = "infrarecon"
domain = "example.com"
# S3 bucket name (must be globally unique)
s3_bucket_name = ""
# ECR repository URL (will be created in next step)
ecr_repository_url = ""
# Batch configuration
batch_max_vcpus = 128
batch_vcpu = 1.0
batch_memory = 2048
# Automatic scheduling (optional)
enable_schedule = false
schedule_expression = "cron(0 2 * * ? *)"You can build the Docker image locally or in the cloud using AWS CodeBuild.
# Use automated script
./scripts/build-and-push-docker.sh
# Or manually:
cd src/docker
docker build -t recon-tools:latest .
# Create ECR repository
aws ecr create-repository --repository-name recon-tools --region us-east-1
# Login to ECR
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com
# Tag and push
docker tag recon-tools:latest ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest
docker push ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latestTerraform automatically creates a CodeBuild project that builds and publishes the Docker image to ECR. This option is useful when you don't have Docker installed locally or want to automate the build.
Cloud Build Advantages:
- No need to have Docker installed locally
- Consistent builds in isolated environment
- Easy CI/CD integration
- Centralized logs in CloudWatch
Steps:
# 1. First, apply Terraform (this creates the CodeBuild project and ECR repository)
cd terraform
terraform apply
# 2. Execute the build in the cloud
cd ..
./scripts/run-cloud-build.shThe run-cloud-build.sh script:
- Packages the code (
src/dockerandbuildspec.yml) - Uploads to S3 (
s3://{bucket}/build/source.zip) - Starts the build in CodeBuild
- CodeBuild automatically:
- Logs in to ECR
- Builds the Docker image using
buildspec.yml - Pushes to ECR with
latesttag and commit hash - Logs available in CloudWatch
Monitor the build:
- AWS Console: CodeBuild → Build projects →
{project_name}-docker-build - Or use the link provided by the script
Note: The ECR repository is automatically created by Terraform. The URL will be:
{account_id}.dkr.ecr.{region}.amazonaws.com/recon-tools:latest
terraform.tfvars:
ecr_repository_url = "123456789.dkr.ecr.us-east-1.amazonaws.com/recon-tools:latest"cd terraform
# Initialize Terraform
terraform init
# Validate configuration
terraform validate
# See what will be created
terraform plan
# Apply
terraform applyThis will create all necessary resources. Estimated time: 5-10 minutes
# Use script
./scripts/run-pipeline.sh
# Or via AWS CLI
STEP_FUNCTION_ARN=$(cd terraform && terraform output -raw step_function_arn)
aws stepfunctions start-execution \
--state-machine-arn $STEP_FUNCTION_ARN \
--input '{}' \
--region us-east-1Monitor execution:
- AWS Console: Step Functions → Executions
- Or use the link provided by the script
START
│
▼
PrepareGetDomains (page_size = 4000)
│
▼
GetDomains (Lambda) → 4000 domains from DynamoDB
│
▼
CheckDomainsResult
├─ error → HandleError
├─ no domains → EndPipeline
└─ with domains → PrepareBatches
│
▼
PrepareBatches → 20 batches of 200 domains
│
▼
BatchMap (20 in parallel, ResultPath = null)
│
└─ For each batch:
│
├─ PrepareBatchInput
│
├─ HTTPXJob (Batch) → Saves to S3
│
├─ PrepareHTTPXUrls → Divides into chunks of 100
│
├─ CheckHTTPXPrep
│ ├─ error → SkipBatchOnError
│ └─ ok → DiffHTTPXMap
│
├─ DiffHTTPXMap (50 chunks in parallel, ResultPath = null)
│ └─ For each chunk:
│ └─ DiffHTTPX (Lambda)
│ • Processes 100 URLs
│ • Saves to DynamoDB
│ • Saves result to S3 (temp/)
│
└─ AggregateBatchResults
• Lists S3 files
• Aggregates statistics
• Saves summary to S3
• Deletes temporary files
│
▼
CheckMoreDomains
├─ has_more = true → PrepareGetMoreDomains → GetMoreDomains → (loop)
└─ Default → EndPipeline
- Pagination: 4000 domains per page
- Batches: Up to 20 batches in parallel (200 domains each)
- Chunks: Up to 50 chunks in parallel (100 URLs each)
- Total: Up to 1000 simultaneous chunks (20 × 50)
- No size limit: Uses S3 for intermediate results
- Automatic cleanup: Deletes temporary files after processing
s3://{bucket}/
├── tool=httpx/
│ └── batch={batch_id}/
│ └── dt={date}/
│ └── data.json # Raw HTTPX results
│
├── html_source/
│ └── {domain_safe}/
│ └── {hash}.html # HTML source code (deduplicated)
│
├── html_diff/
│ └── {domain_safe}/
│ └── {url_hash}_v{version}.diff # Diff between versions
│
└── temp/
├── batch={batch_id}/
│ └── chunk={chunk_id}/
│ └── dt={date}/
│ └── result.json # Result of each chunk (temporary)
│
└── batch_summaries/
└── batch={batch_id}/
└── dt={date}/
└── summary.json # Aggregated batch summary
Table: {project_name}-ReconState
| PK | SK | Attributes |
|---|---|---|
URL#{url} |
METADATA |
status_code, content_length, page_title, body_hash, tech, cname, content_type, word_count, cdn_name, cdn_type, domain, first_seen, last_seen, last_updated, change_count, last_change_at |
URL#{url} |
S3_REF |
s3_path_latest_html, updated_at |
URL#{url} |
HISTORY#{timestamp}#{version} |
(all fields from previous version), changes_detected, html_diff_s3_key, timestamp, version |
- Calculates MD5 hash of HTML
- Compares with previous hash in DynamoDB
- If different:
- Saves history in DynamoDB
- Calculates textual diff using
difflib - Saves diff to S3
- Updates metadata (status_code, title, tech, CDN, etc.)
# Add domains
./scripts/manage-domains.sh add example.com
# List domains
./scripts/manage-domains.sh list
# Remove domain
./scripts/manage-domains.sh remove example.com# Test with domain limit
./scripts/test-pipeline-limited.sh 10# Test container locally
docker run -it \
-e S3_BUCKET=my-bucket \
-e JOB_TYPE=httpx \
-e DOMAINS_LIST='["example.com"]' \
recon-tools:latest \
httpx -l /tmp/domains.txt -json -sc -title# Test Lambda locally
aws lambda invoke \
--function-name infrarecon-diff-httpx \
--payload '{"chunk_id": 0, "urls": ["https://example.com"], "batch_id": 0, "job_id": "job-123"}' \
response.jsonbuild-and-push-docker.sh- Builds and publishes Docker image locallyrun-cloud-build.sh- Builds Docker image in the cloud (CodeBuild)run-pipeline.sh- Executes the pipelinemanage-domains.sh- Manages domains in DynamoDBtest-pipeline-limited.sh- Tests with domain limitupload-domains.sh- Bulk upload domains
# VPC and Networking (optional)
vpc_id = ""
subnet_ids = []
security_group_ids = []
# Webhooks (optional)
slack_webhook_url = ""
discord_webhook_url = ""
# Bounty Targets Synchronization (optional)
enable_bounty_sync = false
bounty_sync_schedule = "cron(0 */6 * * ? *)"S3_BUCKET- S3 bucket nameDYNAMODB_TABLE- DynamoDB table nameBATCH_SIZE- Batch size (default: 200)CHUNK_SIZE- Chunk size (default: 100)PAGE_SIZE- Page size (default: 4000)
- Check logs in CloudWatch:
/aws/batch/infrarecon - Check IAM permissions of task role
- Check network connectivity (subnets, security groups)
- Increase
timeoutandmemory_sizeinterraform/lambda.tf - Check file sizes in S3
- Optimize Python code (cache, async processing)
- Check logs in CloudWatch:
/aws/vendedlogs/states/infrarecon-recon-pipeline - Check input JSON format
- Check Step Function IAM permissions
- Check if not exceeding 256KB limit (use S3 for intermediate results)
- Step Functions has a 256KB limit per state
- Solution: Intermediate results are saved to S3
- Check if
ResultPath = nullis configured in Map states
- Check logs in CodeBuild console
- Check CodeBuild IAM permissions
- Check if ECR repository was created
- Check if
buildspec.ymlis correct
- Fargate Spot: ~70% cheaper than On-Demand
- S3: Hash deduplication saves storage
- DynamoDB: Pay-per-request (no provisioning)
- Lambda: Pay-per-invocation
- Step Functions: Pay-per-state-transition
- Adjust
batch_max_vcpusas needed - Configure lifecycle policies in S3
- Adjust log retention (default: 7 days)
- Use Fargate Spot for Batch jobs
- S3 bucket with AES256 encryption
- Block public access enabled
- IAM roles with least privilege
- VPC endpoints recommended for production
- Secrets Manager for sensitive credentials
.
├── terraform/ # Infrastructure as code
│ ├── main.tf
│ ├── variables.tf
│ ├── lambda.tf
│ ├── batch.tf
│ ├── stepfunctions.tf
│ ├── dynamodb.tf
│ ├── s3.tf
│ ├── iam.tf
│ └── codebuild.tf # Automated cloud build
├── src/
│ ├── docker/ # Dockerfile and scripts
│ └── lambdas/ # Lambda functions
├── scripts/ # Helper scripts
├── buildspec.yml # CodeBuild configuration
└── README.md # This file
-
Configure Notifications
- Add Slack/Discord webhooks to Lambdas
- Configure SNS for alerts
-
Optimize Costs
- Adjust
batch_max_vcpusas needed - Configure lifecycle policies in S3
- Adjust log retention
- Adjust
-
Monitoring
- Configure CloudWatch Dashboards
- Create alerts for failures
-
Security
- Review IAM policies (least privilege)
- Enable VPC endpoints if needed
- Configure DynamoDB backup
This project is provided "as is" for educational and research purposes.
If you encounter issues:
- Check logs (CloudWatch)
- Check IAM permissions
- Check if all resources were created
- Consult the Troubleshooting section above