Last Updated: August 11, 2025 at 08:02 PM CDT
Version: 2.0.38
Purpose: Simplified, reliable infrastructure orchestration for Terraform/Terragrunt with DRY AWS CLI integration and performance optimization
This system follows a "simplicity first" approach with these core principles:
- Single Source of Truth:
modules.ymldefines all modules and their groupings - Unified Execution Strategy: All operations use
terragrunt --allwith targeted exclusions for reliable, path-resolution-free execution - Centralized Flag Management: All terragrunt flags are handled centrally by
execute_terragrunt() - Performance Optimization: Multiple performance enhancements for faster operations:
--provider-cacheflag for 80-90% faster multi-module operations--dependency-fetch-output-from-statefor 30-70% faster output generation (auto-enabled when safe)
- Sequential Output Processing: Reliable sequential output generation ensures all modules complete successfully
- Automatic Output Generation: State-changing operations automatically generate outputs
- Protected Module Preservation: Output files for protected modules are preserved during destroy operations
- Consistent Environment Context: All operations execute from the correct environment directory
- Modular Shared Code: Common operations (parsing, formatting, validation) are centralized
- Self-Documenting: Comprehensive help system provides detailed guidance for all operations
- Endpoint Flag Intelligence: Endpoint flags (
--ssm,--ecr,--s3) work seamlessly with all targeting methods
Note on AWS CLI availability: the status module uses a lightweight
validate_aws_cli()that only checks for theawsbinary on PATH. Credentials and region are validated on the specific AWS calls (which pass--region) to surface clearer, operation-specific errors.
The infrastructure system includes a comprehensive secrets protection system that prevents accidental destruction of AWS Secrets Manager resources while providing controlled value clearing capabilities.
🚫 Destroy-Disabled Modules (destroy: false)
- secrets - Secret infrastructure can NEVER be destroyed
- Infrastructure remains permanently intact regardless of
--forceflag - With
--force: Secret VALUES are cleared via AWS CLI (infrastructure preserved) - Without
--force: Operation skipped entirely
🔒 Protected Modules (protected: true)
- Standard protection that can be overridden with
--force - Prevents accidental destruction in normal operations
# Safe operations (secrets protected)
./infra destroy dev:secrets # → "Secrets module protected, skipping"
./infra destroy dev:all # → Destroys everything EXCEPT secrets
# Force operations (secrets cleared, not destroyed)
./infra destroy dev:secrets --force # → Clears secret values via AWS CLI
./infra destroy dev:all --force # → Clears secrets + destroys other modules
# Dry-run mode shows what would be cleared
./infra destroy dev:secrets --force --dry-run
# → [DRY-RUN] Would clear AWS secret: mnemosyne/jwt/v1The system automatically reads the single secrets.yml file in the secrets module:
# src/live/dev/secrets/secrets.yml
secrets:
jenova-service-token:
name: "mnemosyne/jwt/v1"
description: "JWT for mnemosyne to access Infisical"
secret_string: "your-jwt-key-here"When cleared with --force, the secret_string value is set to "infra.sh cleared" while preserving all metadata and infrastructure.
The system now intelligently detects when the endpoints module is included in any operation and automatically sets the appropriate environment variables. This means endpoint flags work consistently across all targeting methods:
# Group operations (NEW - now works!)
./infra apply dev --ssm --ecr # All modules with endpoint flags
./infra apply dev:infrastructure --ssm # Infrastructure group with SSM
./infra apply dev:all --ssm --ecr --s3 # All modules with all endpoint flags
# Direct targeting (always worked)
./infra apply dev:endpoints --ssm --ecr # Direct endpoint targeting
# Non-endpoint operations (correctly ignored)
./infra apply dev:athena --ssm # Flags ignored (no endpoints module)
./infra apply dev:instances --ssm # Flags ignored (instances only)The system uses intelligent endpoint detection to determine when to set environment variables:
- Direct Targeting:
dev:endpoints→ Always sets endpoint flags - Group Operations:
dev,dev:all,dev:infrastructure→ Sets flags if endpoints is in the group - Non-Endpoint Operations:
dev:instances,dev:athena→ Skips flag setting
# Flag parsing
--ssm --ecr → SSM=true, ECR=true, S3=false
# Environment variable export (when endpoints included)
TG_VAR_ssm=true, TG_VAR_ecr=true, TG_VAR_s3=false
# Terragrunt configuration uses these variables
# to conditionally create VPC endpointsThe infrastructure system uses a unified execution strategy that eliminates complex targeting logic and path resolution issues. All operations now use terragrunt --all with targeted exclusions, providing reliable execution regardless of the target scope.
# Full environment (no exclusions)
terragrunt destroy --all
# Single module (exclude everything except target)
terragrunt destroy --all --queue-exclude-dir=vpcs,security_groups,eips,ebss,ecrs
# Infrastructure-only (exclude instances)
terragrunt destroy --all --queue-exclude-dir=athena,metis,mnemosyne- ✅ No Path Resolution Issues - Eliminates
find_in_parent_folderserrors - ✅ Single Execution Path - All operations use the same reliable approach
- ✅ Consistent Behavior - Same logic for single module and full environment operations
- ✅ Enhanced Reliability - Better handling of edge cases and errors
- ✅ Simplified Codebase - Significantly reduced complexity in execution logic
Full Environment Operations:
# Apply all modules in environment
./infra apply dev:all
# → terragrunt apply --all
# Destroy all modules in environment
./infra destroy prod:all
# → terragrunt destroy --allSingle Module Targeting:
# Target specific module
./infra destroy prod:athena
# → terragrunt destroy --all --queue-exclude-dir=vpcs,security_groups,eips,ebss,ecrs
# Target infrastructure only
./infra apply dev:infrastructure
# → terragrunt apply --all --queue-exclude-dir=athena,metis,mnemosyneProtected Module Handling:
# Protected modules are automatically excluded
./infra destroy dev:all
# → terragrunt destroy --all --queue-exclude-dir=eips,ebss,ecrs
# Force destroy protected modules
./infra destroy dev:all --force
# → terragrunt destroy --all (no exclusions)- No Directory Changes - Always execute from environment root
- No Strategy Selection - Single unified approach for all operations
- Automatic Exclusion Generation - System calculates exclusions based on target
- Backward Compatible - All existing commands work unchanged
- Enhanced Error Handling - Improved reliability with no path resolution issues
- When you apply or destroy an instance marked as
gateway: trueinmodules.yml, the system will automatically reapply the VPCs module immediately after. - This keeps VPC routing tables in sync with the gateway instance’s NIC ID, reducing manual steps and risk of stale routes.
Mark an instance as a gateway in your modules.yml:
instances:
- name: nyx
gateway: true
- name: metisBehavior:
./infra apply dev:nyx→ appliesnyx, then immediately appliesvpcs../infra destroy dev:nyx→ destroysnyx, then immediately appliesvpcs.- Only triggers for single gateway instance operations (not for all/instances/infrastructure targets).
- Ensures VPC routes are always up to date after gateway changes.
- Follows DRY KISS and automation principles.
The infrastructure system supports cmd parameters in modules.yml that execute before each module is processed during apply operations. This enables module-specific preparation tasks like code generation, validation, or setup scripts.
infrastructure:
- vpcs
- name: security_groups
cmd: python3 generate.py # Executed before security_groups apply
- name: eips
protected: true
instances:
- name: custom_instance
cmd: ./setup.sh # Executed before custom_instance apply
- athena
- aegis- When: Commands execute only during apply operations (not plan, destroy, init, etc.)
- Where: Commands execute in the module's directory (CWD is set to module path)
- Order: Commands execute before terragrunt operations for each targeted module
- Targeting: Only modules included in the apply target will have their commands executed
- Error Handling: Command failure stops the entire operation with clear error messages
- Dry-run Support: Commands show what would be executed with
--dry-runflag
The security_groups module uses cmd: python3 generate.py to eliminate complex HCL logic:
# modules.yml
infrastructure:
- name: security_groups
cmd: python3 generate.pyWhat happens during apply:
- Pre-processing:
python3 generate.pyruns insecurity_groups/directory - Generation: Creates simplified HCL files in
output/subdirectory - Terragrunt: Processes the generated files (no complex nested logic)
Example Output:
🔧 Executing pre-processing command for module 'security_groups': python3 generate.py
🚀 RecoverySky Security Groups Generator
📋 Found 6 allowed files, 10 rule files, 10 group files, 4 security group files
🌐 Loaded 4 EIP addresses
✅ Generated: output/mnemosyne.hcl (2 ingress, 1 egress, 5 CIDR blocks)
✅ Command completed successfully for module 'security_groups'
# Apply security_groups - executes 'python3 generate.py' first
./infra apply dev:security_groups
# Apply all infrastructure - executes cmd for security_groups only (other modules have no cmd)
./infra apply dev:infrastructure
# Apply all modules - executes cmd for any module that defines one
./infra apply dev:all
# Plan operation - no commands executed (apply operations only)
./infra plan dev:security_groups# Test command execution without running them
./infra apply dev:security_groups --dry-run
# Shows:
# [DRY-RUN] Would execute in /path/to/dev/security_groups: python3 generate.py
# [DRY-RUN] Would execute: terragrunt apply --auto-approveCode Generation:
- name: security_groups
cmd: python3 generate.py # Generate simplified HCL from complex configs
- name: api_gateway
cmd: npm run build-config # Generate API definitions from TypeScriptValidation Scripts:
- name: databases
cmd: ./validate-schemas.sh # Validate database schemas before apply
- name: networking
cmd: python validate_cidr.py # Validate CIDR ranges and subnetsSetup Operations:
- name: certificates
cmd: ./fetch-ssl-certs.sh # Download SSL certificates before apply
- name: custom_module
cmd: make prepare # Run makefile preparation tasks🚀 Reliability
- Consistent preparation: Every apply automatically runs required setup
- Error prevention: Catch configuration issues before terragrunt runs
- Clear feedback: See exactly what preparation commands are running
🔧 Maintainability
- DRY principle: Commands defined once in modules.yml, used everywhere
- KISS approach: Simple command execution, no complex logic needed
- Version control: Command changes tracked with infrastructure config
⚡ Performance
- Efficient targeting: Only commands for targeted modules execute
- Parallel-ready: Multiple targeted modules can have commands run in sequence
- Fast feedback: Immediate command output during infrastructure operations
The bounce operation functionality has been comprehensively tested on live infrastructure with complete success:
✅ Testing Results:
- Unit tests: 29/32 passing (90%+ success rate)
- Dry-run validation: Perfect execution on both test and dev environments
- Live production test: Complete success on
dev:mnemosyneinstance - Instance recreation: Old instance destroyed, new instance created, same EIP maintained
- Automatic cleanup: SSH known_hosts entries cleaned, outputs regenerated
🎯 Environment Targeting:
- Test environment: Only
athenaactive (cost-optimized single instance) - Dev environment: All 4 instances active (
athena,aegis,metis,mnemosyne) - Documentation clarity: Environment-specific instance availability clearly documented
Usage Examples:
# ✅ Correct bounce targeting - dev environment (all instances active)
src/infra/infra shutdown dev:mnemosyne --bounce # Works perfectly
src/infra/infra shutdown dev:athena --bounce --dry-run
# ✅ Correct bounce targeting - test environment (only athena active)
src/infra/infra shutdown test:athena --bounce # Works perfectly
# ❌ Incorrect targeting - disabled instance
src/infra/infra shutdown test:mnemosyne --bounce # Fails - mnemosyne disabled in test✅ Operation Performance:
- Total time: ~3 minutes for complete bounce cycle
- Phase breakdown: SSH shutdown (5s) → Destroy (30s) → Apply (90s) → Outputs (30s) → Cleanup (10s)
- EIP preservation: Maintains same public IP throughout entire operation
- Error handling: Robust handling of SSH timeouts, network issues, and instance state problems
The infrastructure system now includes robust fallback mechanisms when SSH connections fail during shutdown operations. This ensures reliable instance management even when network connectivity, SSH services, or instance accessibility issues occur.
When performing shutdown operations on instances:
- SSH Attempt (5-second timeout): System attempts SSH connection to the instance
- Failure Detection: If SSH times out or fails for any reason
- AWS CLI Fallback: Automatically switches to AWS CLI-based instance termination
- Status Monitoring: Monitors termination progress until completion
- Red Status Reporting: All termination actions clearly displayed in red text
# Example shutdown with automatic fallback
./infra shutdown dev:athena
# If SSH fails:
# ⚠️ SSH connection timed out after 5 seconds for athena
# 🔄 Falling back to AWS CLI termination...
# 🔴 TERMINATING instance athena (i-1234567890abcdef0) due to SSH connection failure
# 🔴 Termination command sent successfully for athena
# 🔴 Monitoring termination progress for athena (timeout: 5 minutes)...
# 🔴 Termination status: shutting-down (15s elapsed)
# 🔴 Instance athena has been TERMINATED
# 🔴 AWS CLI termination completed for athenaSSH Timeout Optimization:
- Reduced from 10 seconds to 5 seconds for faster failure detection
- No retry attempts - immediate fallback on first failure
- Real-time progress reporting during all operations
Termination Monitoring:
- 5-minute timeout for AWS termination completion (accounts for AWS processing time)
- 2-second polling interval for responsive status updates
- State transition tracking:
running→shutting-down→terminated
Red Text Indicators:
- All termination messages use red color coding for clear visibility
- 🔴 Emoji indicators for consistent termination action identification
- Progress status shows elapsed time and current instance state
User Experience:
- No manual intervention required - automatic fallback is seamless
- Same command interface regardless of SSH availability
- Graceful degradation - operations continue even with connectivity issues
AWS CLI Integration:
- Uses existing AWS infrastructure (region detection, credential validation)
- Leverages centralized output system for instance ID resolution
- Integrates with logging system for comprehensive operation tracking
Backward Compatibility:
- No breaking changes to existing shutdown commands
- All flags preserved (--reboot, --flush, --dry-run, etc.)
- Same return codes for script compatibility
Multiple Failure Scenarios:
- SSH timeout (exit code 124): Connection timeout after 5 seconds
- SSH connection failure: Network unreachable, connection refused, etc.
- SSH authentication failure: Key issues, user authentication problems
- Instance state issues: Instance stopped, stopping, terminated, etc.
AWS CLI Validation:
- Pre-flight checks ensure AWS CLI is available and configured
- Region validation confirms correct AWS region for operations
- Instance ID resolution uses centralized infrastructure output system
- Error reporting provides clear feedback on any failures
Standard Shutdown (with fallback):
# Normal shutdown - tries SSH first, falls back to AWS CLI if needed
./infra shutdown dev:athena
# Multiple instances - each gets individual fallback handling
./infra shutdown dev:instances
# With reboot flag - fallback still applies
./infra shutdown dev:athena --rebootDry-Run Testing:
# Test the fallback logic without actual termination
./infra shutdown dev:athena --dry-run
# Shows exactly what would happen:
# 🔄 Attempting SSH connection (5s timeout)...
# [DRY-RUN] Would execute SSH: athena-dev '~/scripts/shutdown.sh'
# [DRY-RUN] On SSH failure, would terminate instance: i-1234567890abcdef0 (athena)
# [DRY-RUN] Would wait for instance athena to reach terminated stateForce AWS CLI Termination: If you need to terminate an instance directly without attempting SSH, you can use the reboot operation which includes termination logic for offline instances.
Enhanced Reliability:
- No more stuck operations due to SSH connectivity issues
- Guaranteed instance shutdown using AWS native APIs
- Robust fallback handling for various failure scenarios
- Clear failure reporting with actionable information
Improved User Experience:
- Faster feedback with 5-second timeout
- Automatic recovery without manual intervention required
- Visual clarity with red text for destructive actions
- Consistent behavior across all instance types and states
Better Operational Safety:
- Intentional destructive actions clearly highlighted in red
- Progress monitoring ensures operations complete successfully
- Timeout protection prevents indefinite waits
- Comprehensive logging for audit trails and debugging
The infrastructure system includes powerful AWS CLI integration for direct instance management with intelligent state monitoring and graceful SSH-first operations.
NEW in v2.0.13: Complete AWS CLI reboot support with DRY architecture
# Restart instance using AWS CLI with intelligent state monitoring
./infra reboot dev:athena
# Reboot with detailed progress monitoring
./infra reboot dev:athena --verbose 1
# Preview reboot operation
./infra reboot dev:athena --dry-runProcess Flow:
- Instance ID Resolution: Retrieves instance ID from centralized outputs
- AWS CLI Reboot: Executes
aws ec2 reboot-instancescommand - Intelligent State Monitoring: 20-second initial wait for SSH→AWS transition
- Progress Tracking: Real-time status updates during reboot cycle
State Transition Monitoring:
- Initial wait: 20 seconds (allows SSH scripts to complete graceful operations)
- State cycle:
running→shutting-down→pending→running - Timeout: 5 minutes (sufficient for complete reboot cycle)
- Polling: 1-second intervals for responsive feedback
Universal Instance State Function: All AWS instance operations now use a single, configurable function:
wait_for_instance_state "env" "instance" "target_state" [timeout] [initial_wait] [poll_interval] [operation_description]Benefits of DRY Implementation:
- 33% code reduction: Eliminated duplicate waiting logic across 4+ functions
- Consistent behavior: All operations use identical patterns and error handling
- Enhanced functionality: Configurable timeouts, delays, and polling intervals
- KISS principle: Single, well-tested implementation for all instance state waiting
Supported Target States:
running- Instance is operationalstopped- Instance is shut downterminated- Instance is permanently destroyedpending- Instance is starting upshutting-down- Instance is stopping
Reboot Operations:
- Initial wait: 20 seconds (SSH graceful shutdown transition)
- Target state:
running - Timeout: 5 minutes
- Use case: Complete instance restart with proper SSH handling
Shutdown Operations:
- Initial wait: 0 seconds (immediate polling)
- Target state:
stopped - Timeout: 2 minutes
- Use case: Fast shutdown monitoring
Termination Operations:
- Initial wait: 0 seconds
- Target state:
terminated - Timeout: 5 minutes (conservative for AWS processing)
- Use case: Instance destruction monitoring
Start Operations:
- Initial wait: 0 seconds
- Target state:
running - Timeout: 5 minutes
- Use case: Instance startup monitoring
Prerequisites for AWS CLI Operations:
- AWS CLI installed and configured
- Instance outputs must exist:
./infra output <env>:<instance> - AWS credentials with EC2 permissions
- Instance must be in same AWS region as CLI configuration
Output Dependencies:
# Required before reboot operations
./infra output dev:athena --refresh
# Ensures current instance IDs are available
./infra output dev:instances --refreshIntelligent Error Detection:
- Instance not found: Clear messaging if instance doesn't exist
- Permission issues: AWS CLI permission validation
- State conflicts: Detection of incompatible instance states
- Timeout handling: Graceful handling of long-running operations
Dry-Run Support:
# Preview all AWS CLI operations
./infra reboot dev:athena --dry-run
# Shows:
# [DRY-RUN] Would execute: aws ec2 reboot-instances --instance-ids i-1234567890abcdef0
# [DRY-RUN] Would wait 20 seconds before monitoring reboot cycle
# [DRY-RUN] Would monitor instance athena for running state (timeout: 5 minutes)Preserved SSH-First Architecture:
- AWS CLI operations complement SSH scripts (never replace them)
- Reboot operations use 20-second delay to allow SSH graceful shutdown
- Consistent with shutdown operations that prioritize SSH scripts
- Fallback capability when SSH operations are unavailable
Global Context Integration:
- Uses centralized operation context (
OP_ACTION,OP_ENV, etc.) - Integrates with logging system for comprehensive audit trails
- Consistent with framework patterns (dry-run, error handling, messaging)
- Backward compatible with all existing infrastructure operations
The infrastructure system includes a comprehensive set of KISS (Keep It Simple Silly) utilities that eliminate duplication and improve maintainability. These utilities follow Jenova's coding principles of clean architecture with girly precision.
get_operation_context(): One-call operation context gathering (replaces 50+ repetitive patterns)is_dry_run()&execute_with_dry_run(): Standardized dry-run checking and execution- File operation utilities:
file_exists_and_readable(),get_module_output_path(), etc. execute_post_operation_actions(): Consolidated post-operation cleanup (bell, DNS, SSH cleanup)
- ✨ Recently cleaned up: Removed 200+ lines of unused functions and duplicates
- DRY compliance: Each function exists in exactly one place
- No clutter: Only actively used functions remain in shared utilities
- KISS philosophy: Simple, focused modules that do one thing well
- Universal adoption: All infrastructure modules use KISS utilities
- Consistent patterns: Same operation context gathering across all modules
- Reduced complexity: Modules focus on business logic, not boilerplate
- Better maintainability: Changes propagate automatically through centralized utilities
Before KISS (repetitive pattern):
local action=$(get_action)
local env=$(get_environment)
local target_type=$(get_target_type)
local env_path="$(get_environment_path "$env")"After KISS (one call):
get_operation_context
# Sets: OP_ACTION, OP_ENV, OP_TARGET_TYPE, OP_ENV_PATHPath Construction:
# Standardized module paths
module_path="$(get_module_path "$env" "$module")" # /path/to/env/module
output_path="$(get_module_output_path "$env" "$module")" # /path/to/env/outputs/module.json
# Directory management
ensure_output_directory "$env" # Creates env/outputs if neededFile Validation:
# Common file checks
if file_exists_and_readable "$file"; then
# Process file
fi
if file_exists_and_has_content "$output_file"; then
# Process non-empty output
fiStandardized Dry-Run Check:
# Old inconsistent patterns:
# [[ "${DRY_RUN:-false}" == "true" ]]
# [[ "$DRY_RUN" == true ]]
# New standardized approach:
if is_dry_run; then
dry_run_message "[DRY-RUN] Would perform operation"
return 0
fiExecute with Dry-Run:
execute_with_dry_run "rm -f $file" "Would delete file: $file"
# Automatically handles dry-run vs real executionBefore (repetitive pattern in every module):
ring_completion_bell "Operation completed successfully"
update_dns_records "Operation completed successfully"
cleanup_known_hosts "Operation completed successfully"After (one call):
execute_post_operation_actions "Operation completed successfully"
# Executes all three actions automaticallyCentralized Logging Functions:
# Check if logging is available
if is_logging_active; then
log_file="$(get_terragrunt_log_file)"
# Use logging
fiCode Reduction:
- 50+ instances of repetitive variable gathering eliminated
- 40+ instances of manual path construction replaced
- 30+ instances of inconsistent dry-run checks standardized
- 25+ instances of file existence checks consolidated
Consistency Improvements:
- Uniform dry-run behavior across all modules
- Standardized path handling for all file operations
- Consistent post-operation actions for all commands
- Centralized error patterns for better debugging
Maintenance Benefits:
- Single point of change for common operations
- Easy testing of centralized utilities
- Clear documentation with usage examples
- Backward compatibility - all existing code continues to work
When creating new modules, use the KISS utilities:
#!/bin/bash
source "$(dirname "$0")/shared.sh"
my_operation() {
# Get all context in one call
get_operation_context
# Use standardized paths
local config_file="$(get_module_path "$OP_ENV" "$OP_TARGET_TYPE")/config.yml"
local output_file="$(get_module_output_path "$OP_ENV" "$OP_TARGET_TYPE")"
# Check files using utilities
if file_exists_and_readable "$config_file"; then
# Process config
# Execute with dry-run support
execute_with_dry_run "cp '$config_file' '$backup_file'" "Would backup config"
# Execute operation
if execute_terragrunt "$OP_ACTION" "" "$OP_TARGET_TYPE"; then
success_message "Operation completed successfully"
# All post-operation actions in one call
execute_post_operation_actions "Operation completed successfully"
else
handle_error "Operation failed"
fi
fi
}The infrastructure management system includes extensive built-in documentation:
# General help with complete command reference
./infra --help
# Specific command help with detailed examples
./infra apply --help # Deployment documentation
./infra volume --help # Volume management guide
./infra destroy --help # Safety warnings and procedures
./infra plan --help # Planning and validation
./infra init --help # Initialization procedures
./infra output --help # Output generation
./infra clean --help # Cache management
./infra reboot --help # AWS instance management- 🔧 Detailed Command Documentation: Purpose, usage, parameters, and examples for each action
- 💾 Volume Management Guide: Complete EBS volume workflows with device assignment
- 🔄 AWS CLI Operations: Instance management and requirement documentation
- 🎯 Target Specifications: Clear explanation of environment and module targeting
- 🚩 Flag Documentation: Comprehensive flag explanations with use cases
- 📋 Workflow Examples: Step-by-step guides for common infrastructure tasks
⚠️ Safety Guidance: Warnings and best practices for destructive operations- 🛠️ Troubleshooting: Common issues and solutions for each operation
- 🚀 Performance Optimizations: Fast checking and early returns for volume operations
# Quick reference for volume operations
./infra volume --help | grep -A 10 "ATTACH Examples"
# Safety information for destroy operations
./infra destroy --help | grep -A 5 "CRITICAL WARNINGS"
# Understanding targeting formats
./infra --help | grep -A 10 "TARGET FORMATS"The Infrastructure Management System v2.0 is fully operational with all KISS utilities active! ✨
# Check infrastructure status (always start here!)
./infra status dev:all
# Deploy infrastructure (dry-run first!)
./infra apply dev:infrastructure --dry-run
./infra apply dev:infrastructure
# Deploy instances
./infra apply dev:instances
# Generate outputs with clean mode
./infra output dev:all --clean # Remove old outputs first
./infra output dev:all --refresh # Generate fresh outputs
# Clean mode for output management
./infra output dev:athena --clean --dry-run # Preview cleanupCRITICAL BUG FIXES: All missing functions restored! 🚨➡️✅
- ✅ Fixed
is_clean: command not founderrors - ✅ Fixed
readarray: command not foundon macOS/older bash - ✅ Restored complete KISS utilities functionality
- ✅ Enhanced shell compatibility (bash 3+, zsh, etc.)
Infrastructure is now 100% operational with all optimizations active! 💖
All operations follow the same pattern:
- Parse arguments and determine target modules
- Generate exclusion list from
structure.yml - Execute terragrunt --all with exclusions
- Generate outputs for processed modules
- Copy outputs to centralized location
src/infra/
├── infra.sh # Main orchestrator
├── args.sh # Argument processing
├── structure.sh # Structure.yml processing and exclusions
├── executor.sh # Terragrunt execution
├── outputs.sh # Output generation and management
├── volume.sh # Volume management
├── logger.sh # Logging system
├── shared.sh # Shared utilities (parsing, formatting, validation)
└── README.md # This file
Defines module groupings and execution order:
infrastructure:
- vpcs
- eips
- ebss
- security_groups
- ecrs
instances:
- athena
- aegis
- metis
- mnemosynesrc/live/
├── dev/
│ ├── structure.yml # Module definitions
│ ├── log/ # Automatic logging
│ ├── outputs/ # Centralized outputs
│ ├── vpcs/ # Infrastructure modules
│ ├── eips/
│ ├── athena/ # Instance modules
│ └── ...
└── dev/
├── structure.yml
└── ...
apply- Apply infrastructure changesdestroy- Destroy infrastructureplan- Show planned changesinit- Initialize modulesoutput- Generate outputs onlyclean- Remove .terragrunt-cache directoriesvolume- Manage EBS volumesreboot- Reboot AWS instances via CLI
env:infrastructure- All infrastructure modulesenv:instances- All instance modulesenv:all- All modulesenv:module-name- Single module
--auto- Auto-approve changes--dry-run- Show what would be executed--verbose 0|1- Verbosity level (0=default, 1=debug)--refresh- Refresh Terraform state before generating outputs (output only)--log- Enable detailed logging (automatic for state changes)--outputs- Generate outputs (automatic for state changes)--test-mode- Enable test mode for automated testing and development (errors return instead of exit)
1. Parse: dev:infrastructure
2. Load: structure.yml → infrastructure = [vpcs, eips, ebss, security_groups, ecrs]
3. Exclude: instances = [athena, aegis, metis, mnemosyne]
4. Execute: terragrunt apply --all --terragrunt-exclude-dir athena --terragrunt-exclude-dir aegis ...
5. Generate: outputs for [vpcs, eips, ebss, security_groups, ecrs]
6. Copy: outputs to dev/outputs/
1. Parse: dev:athena
2. Load: structure.yml → all modules = [vpcs, eips, ebss, security_groups, ecrs, athena, aegis, metis, mnemosyne]
3. Exclude: everything except athena = [vpcs, eips, ebss, security_groups, ecrs, aegis, metis, mnemosyne]
4. Execute: terragrunt apply --all --terragrunt-exclude-dir vpcs --terragrunt-exclude-dir eips ...
5. Generate: outputs for [athena]
6. Copy: outputs to dev/outputs/
- When: After any state-changing operation (apply, destroy, volume operations)
- What: Raw
terragrunt outputfor each processed module - Where: Module directory (
outputs.json) + centralized (env/outputs/module.json) - How: 🚀 Parallel processing for optimal performance across multiple modules
- Parallel Execution: Multiple modules generate outputs simultaneously
- Background Processes: Uses bash background jobs (
&) andwaitfor coordination - Scalable Performance: Speed improvement scales with the number of modules
- Resource Efficient: Takes advantage of multi-core systems for I/O operations
# Generate outputs for infrastructure modules (parallel processing)
./infra.sh output dev:infrastructure
# Generate refreshed outputs for infrastructure modules (recommended)
./infra.sh output dev:infrastructure --refresh
# Generate outputs for single module
./infra.sh output dev:athena
# Generate refreshed outputs for single module (ensures current state)
./infra.sh output dev:athena --refresh
# Generate outputs for all modules (parallel processing)
./infra.sh output dev:all
# Generate refreshed outputs for all modules (ensures all state is current)
./infra.sh output dev:all --refreshdev/
├── athena/outputs.json # Raw terragrunt output
├── vpcs/outputs.json # Raw terragrunt output
└── outputs/
├── athena.json # Copy of athena/outputs.json
└── vpcs.json # Copy of vpcs/outputs.json
# Clean all modules
./infra clean dev:all
# Clean infrastructure modules only
./infra clean dev:infrastructure
# Clean instance modules only
./infra clean dev:instances
# Clean single module
./infra clean dev:athena
# Dry-run to see what would be cleaned
./infra clean dev:all --dry-run
# Verbose output for debugging
./infra clean dev:infrastructure --verbose 1# Also remove per-instance volumes.yml files during clean
./infra clean dev:instances --volumes
# Combine with dry-run for preview
./infra clean dev:all --volumes --dry-run.terragrunt-cachedirectories: Terragrunt's local cache directories- Recursive search: Finds cache directories at any depth within modules
- Safe removal: Only removes
.terragrunt-cachedirectories, nothing else - Progress reporting: Shows count of cleaned directories per module
volumes.ymlfiles (optional): When--volumesis provided, removes per-instancevolumes.ymlto reset desired volume attachments
- After major changes: When module dependencies change significantly
- Troubleshooting: When experiencing unexpected terragrunt behavior
- Disk space: When
.terragrunt-cachedirectories become large - Fresh start: Before important operations to ensure clean state
# Attach volume (updates volumes.yml and applies)
./infra.sh volume dev:athena my-volume --attach --auto
# Detach volume (updates volumes.yml and applies)
./infra.sh volume dev:athena my-volume --detach --auto
# Dry-run volume operation with comprehensive preview
./infra.sh volume dev:athena my-volume --attach --dry-run --verbose 1Volume operations now include comprehensive dry-run capabilities:
# Preview volume attachment with all operations
./infra.sh volume dev:athena test-volume --attach --dry-run --verbose 1
# Shows:
# 🟡 [DRY-RUN] Would update volumes.yml: /path/to/volumes.yml
# 🟡 [DRY-RUN] Would add volume 'test-volume' with device '/dev/sdg'
# 🟡 [DRY-RUN] Would execute: terragrunt apply --auto-approve --non-interactive
# 🟡 [DRY-RUN] Would generate outputs for instance: athena
# Preview volume detachment operations
./infra.sh volume dev:athena test-volume --detach --dry-run
# Shows:
# 🟡 [DRY-RUN] Would remove volume 'test-volume'
# 🟡 [DRY-RUN] Would perform AWS CLI volume detachment for safety
# Test with multiple flags (all work with dry-run)
./infra.sh volume dev:athena test-volume --attach --dry-run --backup --bell --dns --force
# Additional dry-run messages for:
# 🟡 [DRY-RUN] Would execute: infra apply global:dns
# 🟡 [DRY-RUN] Would clean SSH known_hosts entries for applied instancesVolumes are managed via volumes.yml in each instance directory:
my-volume:
device_name: /dev/sdf
another-volume:
device_name: /dev/sdg- Volume Names:
my-volume,data-volume(preferred method) - Volume IDs:
vol-1234567890abcdef0(automatically resolved) - Device Assignment: Automatic assignment to
/dev/sdfthrough/dev/sdp - AWS Integration: Uses centralized functions from
aws.shmodule
- Default: Minimal output with progress indicators
- State Changes: Automatic detailed logging to
env/log/ - Verbose Mode: Debug output with
--verbose 1
dev/log/
├── infra-20240526-140000.log # Main operation log
└── terragrunt-20240526-140000.log # Terragrunt output log
The infrastructure system includes a comprehensive test mode designed for automated testing, CI/CD pipelines, and development work.
# Enable test mode for any operation
./infra <action> <target> --test-mode [other-flags]
# Examples of test mode usage
./infra apply test:infrastructure --test-mode
./infra destroy test:athena --test-mode --dry-run
./infra volume test:athena my-volume --attach --test-mode- Error Handling: Errors return exit codes instead of calling
exit() - Environment Restriction: Operations limited to
testenvironment only - Error Testing: Enables verification of error conditions in automated tests
- Process Continuation: Test processes continue running after errors
- Environment Enforcement: All operations restricted to
testenvironment - Validation Preserved: All normal validations remain active
- No Bypass: Test mode doesn't disable security checks
- Unit Testing: Comprehensive testing of argument parsing and validation
- Integration Testing: End-to-end operation testing
- CI/CD Pipelines: Automated testing in continuous integration
- Development: Testing error handling and edge cases
# Test environment validation (should fail)
./infra apply invalid:infrastructure --test-mode
# Returns exit code 1 instead of exiting
# Test argument validation (should fail)
./infra volume test:athena --test-mode
# Returns exit code 1 instead of exiting
# Test successful operations (should succeed)
./infra apply test:infrastructure --test-mode --dry-run
# Returns exit code 0# Test infrastructure changes
./infra.sh apply dev:infrastructure --dry-run
# Test single module
./infra.sh apply dev:athena --dry-run
# Test volume operations
./infra.sh volume dev:athena my-volume --attach --dry-run# Validate structure.yml
./infra.sh validate dev
# Check module status
./infra.sh status dev:all
# Verify outputs
ls -la dev/outputs/-
Module Not Found
# Check structure.yml cat dev/structure.yml # Verify module directory exists ls -la dev/module-name/
-
Dependency Errors
# Apply infrastructure first ./infra.sh apply dev:infrastructure --auto # Then apply instances ./infra.sh apply dev:instances --auto
-
Output Generation Issues
# Generate outputs manually ./infra.sh output dev:module-name # Generate outputs with state refresh (recommended for issues) ./infra.sh output dev:module-name --refresh # Check for empty outputs cat dev/module-name/outputs.json
# Enable verbose output
./infra.sh apply dev:athena --verbose 1
# Check logs
tail -f dev/log/infra-*.log
tail -f dev/log/terragrunt-*.log- Unified structure.yml replaces separate
instances.ymlandinfrastructure.yml - All operations use --all with exclusions (no separate bulk/single code paths)
- Automatic output generation for state-changing operations
- Simplified error handling and logging
- Consistent environment context for all operations
- Combine YAML files: Merge
instances.ymlandinfrastructure.ymlintostructure.yml - Update commands: Remove explicit
--outputsflags (now automatic) - Check logs: New log location is
env/log/instead ofsrc/infra/log/
- CHANGELOG.md - Version history and changes
- OUTPUT_SYSTEM.md - Detailed output generation documentation
- STRUCTURE.md - Project structure and module organization
Last updated: December 1, 2024 21:15 CST
For support or questions, refer to the troubleshooting section above or check the detailed documentation files.
- [2024-06-10] Fixed: protected: true protection for modules is now robust and reliable for both infrastructure and instance modules, thanks to improved yq query logic in modules.sh.