Infrastructure Management System v2.0.38

Last Updated: August 11, 2025 at 08:02 PM CDT

Version: 2.0.38
Purpose: Simplified, reliable infrastructure orchestration for Terraform/Terragrunt with DRY AWS CLI integration and performance optimization

🎯 Design Philosophy

This system follows a "simplicity first" approach with these core principles:

Single Source of Truth: modules.yml defines all modules and their groupings
Unified Execution Strategy: All operations use terragrunt --all with targeted exclusions for reliable, path-resolution-free execution
Centralized Flag Management: All terragrunt flags are handled centrally by execute_terragrunt()
Performance Optimization: Multiple performance enhancements for faster operations:
- --provider-cache flag for 80-90% faster multi-module operations
- --dependency-fetch-output-from-state for 30-70% faster output generation (auto-enabled when safe)
Sequential Output Processing: Reliable sequential output generation ensures all modules complete successfully
Automatic Output Generation: State-changing operations automatically generate outputs
Protected Module Preservation: Output files for protected modules are preserved during destroy operations
Consistent Environment Context: All operations execute from the correct environment directory
Modular Shared Code: Common operations (parsing, formatting, validation) are centralized
Self-Documenting: Comprehensive help system provides detailed guidance for all operations
Endpoint Flag Intelligence: Endpoint flags (--ssm, --ecr, --s3) work seamlessly with all targeting methods

Note on AWS CLI availability: the status module uses a lightweight validate_aws_cli() that only checks for the aws binary on PATH. Credentials and region are validated on the specific AWS calls (which pass --region) to surface clearer, operation-specific errors.

🔒 Secrets Protection System

Ultimate AWS Secrets Manager Protection

The infrastructure system includes a comprehensive secrets protection system that prevents accidental destruction of AWS Secrets Manager resources while providing controlled value clearing capabilities.

Protection Levels

🚫 Destroy-Disabled Modules (destroy: false)

secrets - Secret infrastructure can NEVER be destroyed
Infrastructure remains permanently intact regardless of --force flag
With --force: Secret VALUES are cleared via AWS CLI (infrastructure preserved)
Without --force: Operation skipped entirely

🔒 Protected Modules (protected: true)

Standard protection that can be overridden with --force
Prevents accidental destruction in normal operations

Secrets Clearing Behavior

# Safe operations (secrets protected)
./infra destroy dev:secrets              # → "Secrets module protected, skipping"
./infra destroy dev:all                  # → Destroys everything EXCEPT secrets

# Force operations (secrets cleared, not destroyed)
./infra destroy dev:secrets --force      # → Clears secret values via AWS CLI
./infra destroy dev:all --force          # → Clears secrets + destroys other modules

# Dry-run mode shows what would be cleared
./infra destroy dev:secrets --force --dry-run
# → [DRY-RUN] Would clear AWS secret: mnemosyne/jwt/v1

Secret Discovery

The system automatically reads the single secrets.yml file in the secrets module:

# src/live/dev/secrets/secrets.yml
secrets:
  jenova-service-token:
    name: "mnemosyne/jwt/v1"
    description: "JWT for mnemosyne to access Infisical"
    secret_string: "your-jwt-key-here"

When cleared with --force, the secret_string value is set to "infra.sh cleared" while preserving all metadata and infrastructure.

🚀 Enhanced Endpoint Flag Support

Intelligent Endpoint Flag Detection

The system now intelligently detects when the endpoints module is included in any operation and automatically sets the appropriate environment variables. This means endpoint flags work consistently across all targeting methods:

All Targeting Methods Now Supported

# Group operations (NEW - now works!)
./infra apply dev --ssm --ecr                    # All modules with endpoint flags
./infra apply dev:infrastructure --ssm           # Infrastructure group with SSM
./infra apply dev:all --ssm --ecr --s3           # All modules with all endpoint flags

# Direct targeting (always worked)
./infra apply dev:endpoints --ssm --ecr          # Direct endpoint targeting

# Non-endpoint operations (correctly ignored)
./infra apply dev:athena --ssm                   # Flags ignored (no endpoints module)
./infra apply dev:instances --ssm                # Flags ignored (instances only)

How It Works

The system uses intelligent endpoint detection to determine when to set environment variables:

Direct Targeting: dev:endpoints → Always sets endpoint flags
Group Operations: dev, dev:all, dev:infrastructure → Sets flags if endpoints is in the group
Non-Endpoint Operations: dev:instances, dev:athena → Skips flag setting

Environment Variable Flow

# Flag parsing
--ssm --ecr → SSM=true, ECR=true, S3=false

# Environment variable export (when endpoints included)
TG_VAR_ssm=true, TG_VAR_ecr=true, TG_VAR_s3=false

# Terragrunt configuration uses these variables
# to conditionally create VPC endpoints

🎯 Unified Execution Strategy

Single Execution Path with Exclusions

The infrastructure system uses a unified execution strategy that eliminates complex targeting logic and path resolution issues. All operations now use terragrunt --all with targeted exclusions, providing reliable execution regardless of the target scope.

How It Works

# Full environment (no exclusions)
terragrunt destroy --all

# Single module (exclude everything except target)
terragrunt destroy --all --queue-exclude-dir=vpcs,security_groups,eips,ebss,ecrs

# Infrastructure-only (exclude instances)
terragrunt destroy --all --queue-exclude-dir=athena,metis,mnemosyne

Key Benefits

✅ No Path Resolution Issues - Eliminates find_in_parent_folders errors
✅ Single Execution Path - All operations use the same reliable approach
✅ Consistent Behavior - Same logic for single module and full environment operations
✅ Enhanced Reliability - Better handling of edge cases and errors
✅ Simplified Codebase - Significantly reduced complexity in execution logic

Execution Examples

Full Environment Operations:

# Apply all modules in environment
./infra apply dev:all
# → terragrunt apply --all

# Destroy all modules in environment  
./infra destroy prod:all
# → terragrunt destroy --all

Single Module Targeting:

# Target specific module
./infra destroy prod:athena
# → terragrunt destroy --all --queue-exclude-dir=vpcs,security_groups,eips,ebss,ecrs

# Target infrastructure only
./infra apply dev:infrastructure  
# → terragrunt apply --all --queue-exclude-dir=athena,metis,mnemosyne

Protected Module Handling:

# Protected modules are automatically excluded
./infra destroy dev:all
# → terragrunt destroy --all --queue-exclude-dir=eips,ebss,ecrs

# Force destroy protected modules
./infra destroy dev:all --force
# → terragrunt destroy --all (no exclusions)

Technical Details

No Directory Changes - Always execute from environment root
No Strategy Selection - Single unified approach for all operations
Automatic Exclusion Generation - System calculates exclusions based on target
Backward Compatible - All existing commands work unchanged
Enhanced Error Handling - Improved reliability with no path resolution issues

🚦 Gateway Instance: Automatic VPCs Apply

What’s New?

When you apply or destroy an instance marked as gateway: true in modules.yml, the system will automatically reapply the VPCs module immediately after.
This keeps VPC routing tables in sync with the gateway instance’s NIC ID, reducing manual steps and risk of stale routes.

How to Use

Mark an instance as a gateway in your modules.yml:

instances:
  - name: nyx
    gateway: true
  - name: metis

Behavior:

./infra apply dev:nyx → applies nyx, then immediately applies vpcs.
./infra destroy dev:nyx → destroys nyx, then immediately applies vpcs.
Only triggers for single gateway instance operations (not for all/instances/infrastructure targets).

Why?

Ensures VPC routes are always up to date after gateway changes.
Follows DRY KISS and automation principles.

🔧 Module Pre-Processing Commands

Automated Module Command Execution

The infrastructure system supports cmd parameters in modules.yml that execute before each module is processed during apply operations. This enables module-specific preparation tasks like code generation, validation, or setup scripts.

Configuration Syntax

infrastructure:
  - vpcs
  - name: security_groups
    cmd: python3 generate.py  # Executed before security_groups apply
  - name: eips
    protected: true

instances:
  - name: custom_instance
    cmd: ./setup.sh  # Executed before custom_instance apply
  - athena
  - aegis

Command Execution Behavior

When: Commands execute only during apply operations (not plan, destroy, init, etc.)
Where: Commands execute in the module's directory (CWD is set to module path)
Order: Commands execute before terragrunt operations for each targeted module
Targeting: Only modules included in the apply target will have their commands executed
Error Handling: Command failure stops the entire operation with clear error messages
Dry-run Support: Commands show what would be executed with --dry-run flag

Real-World Example: Security Groups Generation

The security_groups module uses cmd: python3 generate.py to eliminate complex HCL logic:

# modules.yml
infrastructure:
  - name: security_groups
    cmd: python3 generate.py

What happens during apply:

Pre-processing: python3 generate.py runs in security_groups/ directory
Generation: Creates simplified HCL files in output/ subdirectory
Terragrunt: Processes the generated files (no complex nested logic)

Example Output:

🔧 Executing pre-processing command for module 'security_groups': python3 generate.py
🚀 RecoverySky Security Groups Generator
   📋 Found 6 allowed files, 10 rule files, 10 group files, 4 security group files
   🌐 Loaded 4 EIP addresses
   ✅ Generated: output/mnemosyne.hcl (2 ingress, 1 egress, 5 CIDR blocks)
✅ Command completed successfully for module 'security_groups'

Usage Examples

# Apply security_groups - executes 'python3 generate.py' first
./infra apply dev:security_groups

# Apply all infrastructure - executes cmd for security_groups only (other modules have no cmd)
./infra apply dev:infrastructure

# Apply all modules - executes cmd for any module that defines one
./infra apply dev:all

# Plan operation - no commands executed (apply operations only)
./infra plan dev:security_groups

Dry-Run Testing

# Test command execution without running them
./infra apply dev:security_groups --dry-run

# Shows:
# [DRY-RUN] Would execute in /path/to/dev/security_groups: python3 generate.py
# [DRY-RUN] Would execute: terragrunt apply --auto-approve

Command Design Patterns

Code Generation:

- name: security_groups
  cmd: python3 generate.py  # Generate simplified HCL from complex configs

- name: api_gateway  
  cmd: npm run build-config  # Generate API definitions from TypeScript

Validation Scripts:

- name: databases
  cmd: ./validate-schemas.sh  # Validate database schemas before apply

- name: networking
  cmd: python validate_cidr.py  # Validate CIDR ranges and subnets

Setup Operations:

- name: certificates
  cmd: ./fetch-ssl-certs.sh  # Download SSL certificates before apply

- name: custom_module
  cmd: make prepare  # Run makefile preparation tasks

Benefits & Advantages

🚀 Reliability

Consistent preparation: Every apply automatically runs required setup
Error prevention: Catch configuration issues before terragrunt runs
Clear feedback: See exactly what preparation commands are running

🔧 Maintainability

DRY principle: Commands defined once in modules.yml, used everywhere
KISS approach: Simple command execution, no complex logic needed
Version control: Command changes tracked with infrastructure config

⚡ Performance

Efficient targeting: Only commands for targeted modules execute
Parallel-ready: Multiple targeted modules can have commands run in sequence
Fast feedback: Immediate command output during infrastructure operations

✅ Production-Tested Bounce Operations

Live Infrastructure Validation

The bounce operation functionality has been comprehensively tested on live infrastructure with complete success:

✅ Testing Results:

Unit tests: 29/32 passing (90%+ success rate)
Dry-run validation: Perfect execution on both test and dev environments
Live production test: Complete success on dev:mnemosyne instance
Instance recreation: Old instance destroyed, new instance created, same EIP maintained
Automatic cleanup: SSH known_hosts entries cleaned, outputs regenerated

🎯 Environment Targeting:

Test environment: Only athena active (cost-optimized single instance)
Dev environment: All 4 instances active (athena, aegis, metis, mnemosyne)
Documentation clarity: Environment-specific instance availability clearly documented

Usage Examples:

# ✅ Correct bounce targeting - dev environment (all instances active)
src/infra/infra shutdown dev:mnemosyne --bounce      # Works perfectly
src/infra/infra shutdown dev:athena --bounce --dry-run

# ✅ Correct bounce targeting - test environment (only athena active)
src/infra/infra shutdown test:athena --bounce        # Works perfectly

# ❌ Incorrect targeting - disabled instance
src/infra/infra shutdown test:mnemosyne --bounce     # Fails - mnemosyne disabled in test

✅ Operation Performance:

Total time: ~3 minutes for complete bounce cycle
Phase breakdown: SSH shutdown (5s) → Destroy (30s) → Apply (90s) → Outputs (30s) → Cleanup (10s)
EIP preservation: Maintains same public IP throughout entire operation
Error handling: Robust handling of SSH timeouts, network issues, and instance state problems

🔴 Enhanced Shutdown Operations - AWS CLI Termination Fallback

Intelligent SSH Failure Handling

The infrastructure system now includes robust fallback mechanisms when SSH connections fail during shutdown operations. This ensures reliable instance management even when network connectivity, SSH services, or instance accessibility issues occur.

🔄 Automatic Fallback Flow

When performing shutdown operations on instances:

SSH Attempt (5-second timeout): System attempts SSH connection to the instance
Failure Detection: If SSH times out or fails for any reason
AWS CLI Fallback: Automatically switches to AWS CLI-based instance termination
Status Monitoring: Monitors termination progress until completion
Red Status Reporting: All termination actions clearly displayed in red text

# Example shutdown with automatic fallback
./infra shutdown dev:athena

# If SSH fails:
# ⚠️  SSH connection timed out after 5 seconds for athena
# 🔄 Falling back to AWS CLI termination...
# 🔴 TERMINATING instance athena (i-1234567890abcdef0) due to SSH connection failure
# 🔴 Termination command sent successfully for athena
# 🔴 Monitoring termination progress for athena (timeout: 5 minutes)...
# 🔴 Termination status: shutting-down (15s elapsed)
# 🔴 Instance athena has been TERMINATED
# 🔴 AWS CLI termination completed for athena

⚡ Fast Response Times

SSH Timeout Optimization:

Reduced from 10 seconds to 5 seconds for faster failure detection
No retry attempts - immediate fallback on first failure
Real-time progress reporting during all operations

Termination Monitoring:

5-minute timeout for AWS termination completion (accounts for AWS processing time)
2-second polling interval for responsive status updates
State transition tracking: running → shutting-down → terminated

🎯 Clear Visual Feedback

Red Text Indicators:

All termination messages use red color coding for clear visibility
🔴 Emoji indicators for consistent termination action identification
Progress status shows elapsed time and current instance state

User Experience:

No manual intervention required - automatic fallback is seamless
Same command interface regardless of SSH availability
Graceful degradation - operations continue even with connectivity issues

🔗 System Integration

AWS CLI Integration:

Uses existing AWS infrastructure (region detection, credential validation)
Leverages centralized output system for instance ID resolution
Integrates with logging system for comprehensive operation tracking

Backward Compatibility:

No breaking changes to existing shutdown commands
All flags preserved (--reboot, --flush, --dry-run, etc.)
Same return codes for script compatibility

🛡️ Robust Error Handling

Multiple Failure Scenarios:

SSH timeout (exit code 124): Connection timeout after 5 seconds
SSH connection failure: Network unreachable, connection refused, etc.
SSH authentication failure: Key issues, user authentication problems
Instance state issues: Instance stopped, stopping, terminated, etc.

AWS CLI Validation:

Pre-flight checks ensure AWS CLI is available and configured
Region validation confirms correct AWS region for operations
Instance ID resolution uses centralized infrastructure output system
Error reporting provides clear feedback on any failures

💡 Usage Examples

Standard Shutdown (with fallback):

# Normal shutdown - tries SSH first, falls back to AWS CLI if needed
./infra shutdown dev:athena

# Multiple instances - each gets individual fallback handling
./infra shutdown dev:instances

# With reboot flag - fallback still applies
./infra shutdown dev:athena --reboot

Dry-Run Testing:

# Test the fallback logic without actual termination
./infra shutdown dev:athena --dry-run

# Shows exactly what would happen:
# 🔄 Attempting SSH connection (5s timeout)...
# [DRY-RUN] Would execute SSH: athena-dev '~/scripts/shutdown.sh'
# [DRY-RUN] On SSH failure, would terminate instance: i-1234567890abcdef0 (athena)
# [DRY-RUN] Would wait for instance athena to reach terminated state

Force AWS CLI Termination: If you need to terminate an instance directly without attempting SSH, you can use the reboot operation which includes termination logic for offline instances.

🚀 Benefits

Enhanced Reliability:

No more stuck operations due to SSH connectivity issues
Guaranteed instance shutdown using AWS native APIs
Robust fallback handling for various failure scenarios
Clear failure reporting with actionable information

Improved User Experience:

Faster feedback with 5-second timeout
Automatic recovery without manual intervention required
Visual clarity with red text for destructive actions
Consistent behavior across all instance types and states

Better Operational Safety:

Intentional destructive actions clearly highlighted in red
Progress monitoring ensures operations complete successfully
Timeout protection prevents indefinite waits
Comprehensive logging for audit trails and debugging

🔄 AWS CLI Instance Operations - DRY Architecture v2.0.13

Native AWS Instance Management

The infrastructure system includes powerful AWS CLI integration for direct instance management with intelligent state monitoring and graceful SSH-first operations.

🎮 Reboot Operations

NEW in v2.0.13: Complete AWS CLI reboot support with DRY architecture

# Restart instance using AWS CLI with intelligent state monitoring
./infra reboot dev:athena

# Reboot with detailed progress monitoring
./infra reboot dev:athena --verbose 1

# Preview reboot operation
./infra reboot dev:athena --dry-run

Process Flow:

Instance ID Resolution: Retrieves instance ID from centralized outputs
AWS CLI Reboot: Executes aws ec2 reboot-instances command
Intelligent State Monitoring: 20-second initial wait for SSH→AWS transition
Progress Tracking: Real-time status updates during reboot cycle

State Transition Monitoring:

Initial wait: 20 seconds (allows SSH scripts to complete graceful operations)
State cycle: running → shutting-down → pending → running
Timeout: 5 minutes (sufficient for complete reboot cycle)
Polling: 1-second intervals for responsive feedback

🏗️ DRY Generic State Waiting Architecture

Universal Instance State Function: All AWS instance operations now use a single, configurable function:

wait_for_instance_state "env" "instance" "target_state" [timeout] [initial_wait] [poll_interval] [operation_description]

Benefits of DRY Implementation:

33% code reduction: Eliminated duplicate waiting logic across 4+ functions
Consistent behavior: All operations use identical patterns and error handling
Enhanced functionality: Configurable timeouts, delays, and polling intervals
KISS principle: Single, well-tested implementation for all instance state waiting

Supported Target States:

running - Instance is operational
stopped - Instance is shut down
terminated - Instance is permanently destroyed
pending - Instance is starting up
shutting-down - Instance is stopping

⚙️ Operation-Specific Configurations

Reboot Operations:

Initial wait: 20 seconds (SSH graceful shutdown transition)
Target state: running
Timeout: 5 minutes
Use case: Complete instance restart with proper SSH handling

Shutdown Operations:

Initial wait: 0 seconds (immediate polling)
Target state: stopped
Timeout: 2 minutes
Use case: Fast shutdown monitoring

Termination Operations:

Initial wait: 0 seconds
Target state: terminated
Timeout: 5 minutes (conservative for AWS processing)
Use case: Instance destruction monitoring

Start Operations:

Initial wait: 0 seconds
Target state: running
Timeout: 5 minutes
Use case: Instance startup monitoring

🔧 Requirements

Prerequisites for AWS CLI Operations:

AWS CLI installed and configured
Instance outputs must exist: ./infra output <env>:<instance>
AWS credentials with EC2 permissions
Instance must be in same AWS region as CLI configuration

Output Dependencies:

# Required before reboot operations
./infra output dev:athena --refresh

# Ensures current instance IDs are available
./infra output dev:instances --refresh

🛡️ Error Handling and Safety

Intelligent Error Detection:

Instance not found: Clear messaging if instance doesn't exist
Permission issues: AWS CLI permission validation
State conflicts: Detection of incompatible instance states
Timeout handling: Graceful handling of long-running operations

Dry-Run Support:

# Preview all AWS CLI operations
./infra reboot dev:athena --dry-run

# Shows:
# [DRY-RUN] Would execute: aws ec2 reboot-instances --instance-ids i-1234567890abcdef0
# [DRY-RUN] Would wait 20 seconds before monitoring reboot cycle
# [DRY-RUN] Would monitor instance athena for running state (timeout: 5 minutes)

✅ Framework Integration

Preserved SSH-First Architecture:

AWS CLI operations complement SSH scripts (never replace them)
Reboot operations use 20-second delay to allow SSH graceful shutdown
Consistent with shutdown operations that prioritize SSH scripts
Fallback capability when SSH operations are unavailable

Global Context Integration:

Uses centralized operation context (OP_ACTION, OP_ENV, etc.)
Integrates with logging system for comprehensive audit trails
Consistent with framework patterns (dry-run, error handling, messaging)
Backward compatible with all existing infrastructure operations

💖 KISS Utilities and Shared Code Enhancement

The infrastructure system includes a comprehensive set of KISS (Keep It Simple Silly) utilities that eliminate duplication and improve maintainability. These utilities follow Jenova's coding principles of clean architecture with girly precision.

Centralized KISS Helper Functions

get_operation_context(): One-call operation context gathering (replaces 50+ repetitive patterns)
is_dry_run() & execute_with_dry_run(): Standardized dry-run checking and execution
File operation utilities: file_exists_and_readable(), get_module_output_path(), etc.
execute_post_operation_actions(): Consolidated post-operation cleanup (bell, DNS, SSH cleanup)

Clean and Focused Architecture

✨ Recently cleaned up: Removed 200+ lines of unused functions and duplicates
DRY compliance: Each function exists in exactly one place
No clutter: Only actively used functions remain in shared utilities
KISS philosophy: Simple, focused modules that do one thing well

Enhanced Module Integration

Universal adoption: All infrastructure modules use KISS utilities
Consistent patterns: Same operation context gathering across all modules
Reduced complexity: Modules focus on business logic, not boilerplate
Better maintainability: Changes propagate automatically through centralized utilities

🌟 Operation Context Gathering

Before KISS (repetitive pattern):

local action=$(get_action)
local env=$(get_environment) 
local target_type=$(get_target_type)
local env_path="$(get_environment_path "$env")"

After KISS (one call):

get_operation_context
# Sets: OP_ACTION, OP_ENV, OP_TARGET_TYPE, OP_ENV_PATH

🔧 Standardized File Operations

Path Construction:

# Standardized module paths
module_path="$(get_module_path "$env" "$module")"        # /path/to/env/module
output_path="$(get_module_output_path "$env" "$module")" # /path/to/env/outputs/module.json

# Directory management
ensure_output_directory "$env"  # Creates env/outputs if needed

File Validation:

# Common file checks
if file_exists_and_readable "$file"; then
    # Process file
fi

if file_exists_and_has_content "$output_file"; then
    # Process non-empty output
fi

🟡 Consistent Dry-Run Handling

Standardized Dry-Run Check:

# Old inconsistent patterns:
# [[ "${DRY_RUN:-false}" == "true" ]]
# [[ "$DRY_RUN" == true ]]

# New standardized approach:
if is_dry_run; then
    dry_run_message "[DRY-RUN] Would perform operation"
    return 0
fi

Execute with Dry-Run:

execute_with_dry_run "rm -f $file" "Would delete file: $file"
# Automatically handles dry-run vs real execution

✨ Post-Operation Actions

Before (repetitive pattern in every module):

ring_completion_bell "Operation completed successfully"
update_dns_records "Operation completed successfully"  
cleanup_known_hosts "Operation completed successfully"

After (one call):

execute_post_operation_actions "Operation completed successfully"
# Executes all three actions automatically

✅ Logging Integration

Centralized Logging Functions:

# Check if logging is available
if is_logging_active; then
    log_file="$(get_terragrunt_log_file)"
    # Use logging
fi

🎯 Benefits Realized

Code Reduction:

50+ instances of repetitive variable gathering eliminated
40+ instances of manual path construction replaced
30+ instances of inconsistent dry-run checks standardized
25+ instances of file existence checks consolidated

Consistency Improvements:

Uniform dry-run behavior across all modules
Standardized path handling for all file operations
Consistent post-operation actions for all commands
Centralized error patterns for better debugging

Maintenance Benefits:

Single point of change for common operations
Easy testing of centralized utilities
Clear documentation with usage examples
Backward compatibility - all existing code continues to work

💡 Usage in Your Modules

When creating new modules, use the KISS utilities:

#!/bin/bash
source "$(dirname "$0")/shared.sh"

my_operation() {
    # Get all context in one call
    get_operation_context
    
    # Use standardized paths
    local config_file="$(get_module_path "$OP_ENV" "$OP_TARGET_TYPE")/config.yml"
    local output_file="$(get_module_output_path "$OP_ENV" "$OP_TARGET_TYPE")"
    
    # Check files using utilities
    if file_exists_and_readable "$config_file"; then
        # Process config
        
        # Execute with dry-run support
        execute_with_dry_run "cp '$config_file' '$backup_file'" "Would backup config"
        
        # Execute operation
        if execute_terragrunt "$OP_ACTION" "" "$OP_TARGET_TYPE"; then
            success_message "Operation completed successfully"
            
            # All post-operation actions in one call
            execute_post_operation_actions "Operation completed successfully"
        else
            handle_error "Operation failed"
        fi
    fi
}

🆘 Getting Help

Comprehensive Help System

The infrastructure management system includes extensive built-in documentation:

# General help with complete command reference
./infra --help

# Specific command help with detailed examples
./infra apply --help        # Deployment documentation
./infra volume --help       # Volume management guide  
./infra destroy --help      # Safety warnings and procedures
./infra plan --help         # Planning and validation
./infra init --help         # Initialization procedures
./infra output --help       # Output generation
./infra clean --help        # Cache management
./infra reboot --help       # AWS instance management

Help System Features

🔧 Detailed Command Documentation: Purpose, usage, parameters, and examples for each action
💾 Volume Management Guide: Complete EBS volume workflows with device assignment
🔄 AWS CLI Operations: Instance management and requirement documentation
🎯 Target Specifications: Clear explanation of environment and module targeting
🚩 Flag Documentation: Comprehensive flag explanations with use cases
📋 Workflow Examples: Step-by-step guides for common infrastructure tasks
⚠️ Safety Guidance: Warnings and best practices for destructive operations
🛠️ Troubleshooting: Common issues and solutions for each operation
🚀 Performance Optimizations: Fast checking and early returns for volume operations

Example Help Usage

# Quick reference for volume operations
./infra volume --help | grep -A 10 "ATTACH Examples"

# Safety information for destroy operations  
./infra destroy --help | grep -A 5 "CRITICAL WARNINGS"

# Understanding targeting formats
./infra --help | grep -A 10 "TARGET FORMATS"

🚀 Quick Start

The Infrastructure Management System v2.0 is fully operational with all KISS utilities active! ✨

Basic Operations

# Check infrastructure status (always start here!)
./infra status dev:all

# Deploy infrastructure (dry-run first!)
./infra apply dev:infrastructure --dry-run
./infra apply dev:infrastructure

# Deploy instances  
./infra apply dev:instances

# Generate outputs with clean mode
./infra output dev:all --clean        # Remove old outputs first
./infra output dev:all --refresh      # Generate fresh outputs

# Clean mode for output management
./infra output dev:athena --clean --dry-run    # Preview cleanup

🎯 Recent Updates - v2.0.12 (2024-12-30)

CRITICAL BUG FIXES: All missing functions restored! 🚨➡️✅

✅ Fixed is_clean: command not found errors
✅ Fixed readarray: command not found on macOS/older bash
✅ Restored complete KISS utilities functionality
✅ Enhanced shell compatibility (bash 3+, zsh, etc.)

Infrastructure is now 100% operational with all optimizations active! 💖

🏗️ System Architecture

Unified Operation Model

All operations follow the same pattern:

Parse arguments and determine target modules
Generate exclusion list from structure.yml
Execute terragrunt --all with exclusions
Generate outputs for processed modules
Copy outputs to centralized location

Module Structure

src/infra/
├── infra.sh           # Main orchestrator
├── args.sh            # Argument processing
├── structure.sh       # Structure.yml processing and exclusions
├── executor.sh        # Terragrunt execution
├── outputs.sh         # Output generation and management
├── volume.sh          # Volume management
├── logger.sh          # Logging system
├── shared.sh          # Shared utilities (parsing, formatting, validation)
└── README.md          # This file

📁 Configuration Files

structure.yml

Defines module groupings and execution order:

infrastructure:
  - vpcs
  - eips  
  - ebss
  - security_groups
  - ecrs

instances:
  - athena
  - aegis
  - metis
  - mnemosyne

Directory Structure

src/live/
├── dev/
│   ├── structure.yml          # Module definitions
│   ├── log/                   # Automatic logging
│   ├── outputs/               # Centralized outputs
│   ├── vpcs/                  # Infrastructure modules
│   ├── eips/
│   ├── athena/                # Instance modules
│   └── ...
└── dev/
    ├── structure.yml
    └── ...

🎛️ Command Reference

Actions

apply - Apply infrastructure changes
destroy - Destroy infrastructure
plan - Show planned changes
init - Initialize modules
output - Generate outputs only
clean - Remove .terragrunt-cache directories
volume - Manage EBS volumes
reboot - Reboot AWS instances via CLI

Targets

env:infrastructure - All infrastructure modules
env:instances - All instance modules
env:all - All modules
env:module-name - Single module

Flags

--auto - Auto-approve changes
--dry-run - Show what would be executed
--verbose 0|1 - Verbosity level (0=default, 1=debug)
--refresh - Refresh Terraform state before generating outputs (output only)
--log - Enable detailed logging (automatic for state changes)
--outputs - Generate outputs (automatic for state changes)
--test-mode - Enable test mode for automated testing and development (errors return instead of exit)

🔄 Operation Flows

Infrastructure Apply Flow

1. Parse: dev:infrastructure
2. Load: structure.yml → infrastructure = [vpcs, eips, ebss, security_groups, ecrs]
3. Exclude: instances = [athena, aegis, metis, mnemosyne]
4. Execute: terragrunt apply --all --terragrunt-exclude-dir athena --terragrunt-exclude-dir aegis ...
5. Generate: outputs for [vpcs, eips, ebss, security_groups, ecrs]
6. Copy: outputs to dev/outputs/

Single Module Apply Flow

1. Parse: dev:athena
2. Load: structure.yml → all modules = [vpcs, eips, ebss, security_groups, ecrs, athena, aegis, metis, mnemosyne]
3. Exclude: everything except athena = [vpcs, eips, ebss, security_groups, ecrs, aegis, metis, mnemosyne]
4. Execute: terragrunt apply --all --terragrunt-exclude-dir vpcs --terragrunt-exclude-dir eips ...
5. Generate: outputs for [athena]
6. Copy: outputs to dev/outputs/

📄 Output Management

Automatic Output Generation

When: After any state-changing operation (apply, destroy, volume operations)
What: Raw terragrunt output for each processed module
Where: Module directory (outputs.json) + centralized (env/outputs/module.json)
How: 🚀 Parallel processing for optimal performance across multiple modules

Performance Enhancement

Parallel Execution: Multiple modules generate outputs simultaneously
Background Processes: Uses bash background jobs (&) and wait for coordination
Scalable Performance: Speed improvement scales with the number of modules
Resource Efficient: Takes advantage of multi-core systems for I/O operations

Output Commands

# Generate outputs for infrastructure modules (parallel processing)
./infra.sh output dev:infrastructure

# Generate refreshed outputs for infrastructure modules (recommended)
./infra.sh output dev:infrastructure --refresh

# Generate outputs for single module
./infra.sh output dev:athena

# Generate refreshed outputs for single module (ensures current state)
./infra.sh output dev:athena --refresh

# Generate outputs for all modules (parallel processing)
./infra.sh output dev:all

# Generate refreshed outputs for all modules (ensures all state is current)
./infra.sh output dev:all --refresh

Output File Structure

dev/
├── athena/outputs.json        # Raw terragrunt output
├── vpcs/outputs.json          # Raw terragrunt output
└── outputs/
    ├── athena.json            # Copy of athena/outputs.json
    └── vpcs.json              # Copy of vpcs/outputs.json

🧹 Cache Management

Clean Operations

# Clean all modules
./infra clean dev:all

# Clean infrastructure modules only
./infra clean dev:infrastructure

# Clean instance modules only
./infra clean dev:instances

# Clean single module
./infra clean dev:athena

# Dry-run to see what would be cleaned
./infra clean dev:all --dry-run

# Verbose output for debugging
./infra clean dev:infrastructure --verbose 1

Additional Examples

# Also remove per-instance volumes.yml files during clean
./infra clean dev:instances --volumes

# Combine with dry-run for preview
./infra clean dev:all --volumes --dry-run

What Gets Cleaned

.terragrunt-cache directories: Terragrunt's local cache directories
Recursive search: Finds cache directories at any depth within modules
Safe removal: Only removes .terragrunt-cache directories, nothing else
Progress reporting: Shows count of cleaned directories per module
volumes.yml files (optional): When --volumes is provided, removes per-instance volumes.yml to reset desired volume attachments

When to Use Clean

After major changes: When module dependencies change significantly
Troubleshooting: When experiencing unexpected terragrunt behavior
Disk space: When .terragrunt-cache directories become large
Fresh start: Before important operations to ensure clean state

💾 Volume Management

Volume Operations

# Attach volume (updates volumes.yml and applies)
./infra.sh volume dev:athena my-volume --attach --auto

# Detach volume (updates volumes.yml and applies)  
./infra.sh volume dev:athena my-volume --detach --auto

# Dry-run volume operation with comprehensive preview
./infra.sh volume dev:athena my-volume --attach --dry-run --verbose 1

🟡 Enhanced Dry-Run Support

Volume operations now include comprehensive dry-run capabilities:

# Preview volume attachment with all operations
./infra.sh volume dev:athena test-volume --attach --dry-run --verbose 1
# Shows:
#   🟡 [DRY-RUN] Would update volumes.yml: /path/to/volumes.yml
#   🟡 [DRY-RUN] Would add volume 'test-volume' with device '/dev/sdg'
#   🟡 [DRY-RUN] Would execute: terragrunt apply --auto-approve --non-interactive
#   🟡 [DRY-RUN] Would generate outputs for instance: athena

# Preview volume detachment operations
./infra.sh volume dev:athena test-volume --detach --dry-run
# Shows:
#   🟡 [DRY-RUN] Would remove volume 'test-volume'
#   🟡 [DRY-RUN] Would perform AWS CLI volume detachment for safety

# Test with multiple flags (all work with dry-run)
./infra.sh volume dev:athena test-volume --attach --dry-run --backup --bell --dns --force
# Additional dry-run messages for:
#   🟡 [DRY-RUN] Would execute: infra apply global:dns
#   🟡 [DRY-RUN] Would clean SSH known_hosts entries for applied instances

Volume Configuration

Volumes are managed via volumes.yml in each instance directory:

my-volume:
  device_name: /dev/sdf
another-volume:
  device_name: /dev/sdg

🔧 Volume Resolution

Volume Names: my-volume, data-volume (preferred method)
Volume IDs: vol-1234567890abcdef0 (automatically resolved)
Device Assignment: Automatic assignment to /dev/sdf through /dev/sdp
AWS Integration: Uses centralized functions from aws.sh module

📝 Logging

Automatic Logging

Default: Minimal output with progress indicators
State Changes: Automatic detailed logging to env/log/
Verbose Mode: Debug output with --verbose 1

Log Files

dev/log/
├── infra-20240526-140000.log     # Main operation log
└── terragrunt-20240526-140000.log # Terragrunt output log

🧪 Testing and Validation

Test Mode for Automated Testing

The infrastructure system includes a comprehensive test mode designed for automated testing, CI/CD pipelines, and development work.

`--test-mode` Flag

# Enable test mode for any operation
./infra <action> <target> --test-mode [other-flags]

# Examples of test mode usage
./infra apply test:infrastructure --test-mode
./infra destroy test:athena --test-mode --dry-run
./infra volume test:athena my-volume --attach --test-mode

Test Mode Behavior

Error Handling: Errors return exit codes instead of calling exit()
Environment Restriction: Operations limited to test environment only
Error Testing: Enables verification of error conditions in automated tests
Process Continuation: Test processes continue running after errors

Security Features

Environment Enforcement: All operations restricted to test environment
Validation Preserved: All normal validations remain active
No Bypass: Test mode doesn't disable security checks

Use Cases

Unit Testing: Comprehensive testing of argument parsing and validation
Integration Testing: End-to-end operation testing
CI/CD Pipelines: Automated testing in continuous integration
Development: Testing error handling and edge cases

Example Test Scenarios

# Test environment validation (should fail)
./infra apply invalid:infrastructure --test-mode
# Returns exit code 1 instead of exiting

# Test argument validation (should fail)  
./infra volume test:athena --test-mode
# Returns exit code 1 instead of exiting

# Test successful operations (should succeed)
./infra apply test:infrastructure --test-mode --dry-run
# Returns exit code 0

Dry-Run Testing

# Test infrastructure changes
./infra.sh apply dev:infrastructure --dry-run

# Test single module
./infra.sh apply dev:athena --dry-run

# Test volume operations
./infra.sh volume dev:athena my-volume --attach --dry-run

Validation Commands

# Validate structure.yml
./infra.sh validate dev

# Check module status
./infra.sh status dev:all

# Verify outputs
ls -la dev/outputs/

🔧 Troubleshooting

Common Issues

Module Not Found

# Check structure.yml
cat dev/structure.yml

# Verify module directory exists
ls -la dev/module-name/

Dependency Errors

# Apply infrastructure first
./infra.sh apply dev:infrastructure --auto

# Then apply instances
./infra.sh apply dev:instances --auto

Output Generation Issues

# Generate outputs manually
./infra.sh output dev:module-name

# Generate outputs with state refresh (recommended for issues)
./infra.sh output dev:module-name --refresh

# Check for empty outputs
cat dev/module-name/outputs.json

Debug Mode

# Enable verbose output
./infra.sh apply dev:athena --verbose 1

# Check logs
tail -f dev/log/infra-*.log
tail -f dev/log/terragrunt-*.log

🚀 Migration from v1.x

Key Changes

Unified structure.yml replaces separate instances.yml and infrastructure.yml
All operations use --all with exclusions (no separate bulk/single code paths)
Automatic output generation for state-changing operations
Simplified error handling and logging
Consistent environment context for all operations

Migration Steps

Combine YAML files: Merge instances.yml and infrastructure.yml into structure.yml
Update commands: Remove explicit --outputs flags (now automatic)
Check logs: New log location is env/log/ instead of src/infra/log/

📚 Additional Documentation

CHANGELOG.md - Version history and changes
OUTPUT_SYSTEM.md - Detailed output generation documentation
STRUCTURE.md - Project structure and module organization

Last updated: December 1, 2024 21:15 CST

For support or questions, refer to the troubleshooting section above or check the detailed documentation files.

[2024-06-10] Fixed: protected: true protection for modules is now robust and reliable for both infrastructure and instance modules, thanks to improved yq query logic in modules.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.cursor		.cursor
.vscode		.vscode
bats-core-1.12.0		bats-core-1.12.0
tests		tests
.gitignore		.gitignore
.mutagen.yml		.mutagen.yml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DEV-DIARY.md		DEV-DIARY.md
OUTPUT_SYSTEM.md		OUTPUT_SYSTEM.md
README.md		README.md
README.pdf		README.pdf
Remembered.md		Remembered.md
STRUCTURE.md		STRUCTURE.md
TEST_MODE.md		TEST_MODE.md
TODO.md		TODO.md
args.sh		args.sh
aws.sh		aws.sh
cache.sh		cache.sh
clean.sh		clean.sh
diag.ebs.sh		diag.ebs.sh
diag.ec2.sh		diag.ec2.sh
diag.ecr.sh		diag.ecr.sh
diag.eips.sh		diag.eips.sh
diag.peering.sh		diag.peering.sh
diag.security_groups.sh		diag.security_groups.sh
diag.sh		diag.sh
diag.vpc.sh		diag.vpc.sh
diag.vpc_routes.sh		diag.vpc_routes.sh
display.sh		display.sh
environment.sh		environment.sh
infra		infra
logger.sh		logger.sh
modules.sh		modules.sh
operations.sh		operations.sh
output.sh		output.sh
query.sh		query.sh
shared.sh		shared.sh
shutdown.sh		shutdown.sh
status.sh		status.sh
targeting.sh		targeting.sh
verify.sh		verify.sh
volume.sh		volume.sh

Folders and files

Latest commit

History

Repository files navigation

Infrastructure Management System v2.0.38

🎯 Design Philosophy

🔒 Secrets Protection System

Ultimate AWS Secrets Manager Protection

Protection Levels

Secrets Clearing Behavior

Secret Discovery

🚀 Enhanced Endpoint Flag Support

Intelligent Endpoint Flag Detection

All Targeting Methods Now Supported

How It Works

Environment Variable Flow

🎯 Unified Execution Strategy

Single Execution Path with Exclusions

How It Works

Key Benefits

Execution Examples

Technical Details

🚦 Gateway Instance: Automatic VPCs Apply

What’s New?

How to Use

Why?

🔧 Module Pre-Processing Commands

Automated Module Command Execution

Configuration Syntax

Command Execution Behavior

Real-World Example: Security Groups Generation

Usage Examples

Dry-Run Testing

Command Design Patterns

Benefits & Advantages

✅ Production-Tested Bounce Operations

Live Infrastructure Validation

🔴 Enhanced Shutdown Operations - AWS CLI Termination Fallback

Intelligent SSH Failure Handling

🔄 Automatic Fallback Flow

⚡ Fast Response Times

🎯 Clear Visual Feedback

🔗 System Integration

🛡️ Robust Error Handling

💡 Usage Examples

🚀 Benefits

🔄 AWS CLI Instance Operations - DRY Architecture v2.0.13

Native AWS Instance Management

🎮 Reboot Operations

🏗️ DRY Generic State Waiting Architecture

⚙️ Operation-Specific Configurations

🔧 Requirements

🛡️ Error Handling and Safety

✅ Framework Integration

💖 KISS Utilities and Shared Code Enhancement

Centralized KISS Helper Functions

Clean and Focused Architecture

Enhanced Module Integration

🌟 Operation Context Gathering

🔧 Standardized File Operations

🟡 Consistent Dry-Run Handling

✨ Post-Operation Actions

✅ Logging Integration

🎯 Benefits Realized

💡 Usage in Your Modules

🆘 Getting Help

Comprehensive Help System

Help System Features

Example Help Usage

🚀 Quick Start

Basic Operations

🎯 Recent Updates - v2.0.12 (2024-12-30)

🏗️ System Architecture

Unified Operation Model

Module Structure

📁 Configuration Files

structure.yml

Directory Structure

🎛️ Command Reference

Actions

`--test-mode` Flag

Packages