Complete architecture documentation for Ansible Automation Platform with EnterpriseDB PostgreSQL
- Architecture Overview
- Component Details
- Network Connectivity
- Data Flow
- AAP Deployment Architecture
- AAP Cluster Management
- Disaster Recovery Scenarios
- Scaling Considerations
- Related Architecture Documentation
This architecture implements EnterpriseDB PostgreSQL deployed Active/Passive across two clusters in different datacenters with in-datacenter replication for Ansible Automation Platform (AAP). This achieves a NEAR HA type architecture, especially for failover to the databases syncing in region/datacenter.
Key characteristics:
- Topology: Active-Passive multi-datacenter
- HA Strategy: In-datacenter automatic failover, cross-datacenter manual failover
- Replication: Physical streaming replication + WAL archiving
- RTO Target: <1 minute (in-datacenter), <5 minutes (cross-datacenter)
- RPO Target: <5 seconds (streaming replication)
A DR scenario should be used when there is a catastrophic failure. Failing to an in-site database should cause little to no intervention needed at the application layer. The main thing to note is for a DR failover any running jobs will be lost, however if it fails in-site, the jobs should continue to run UNLESS the controller has a failure.
The global load balancer provides a single entry point for AAP access:
- DNS:
aap.example.com - Type: Active-Passive (DC1 primary, DC2 standby)
- Health Checks: Monitors AAP Controller availability in both datacenters
- Failover: Automatic failover to DC2 if DC1 becomes unavailable
- Routing: Priority-based routing (100% traffic to DC1 when healthy)
- Failback: Automatic or manual failback to DC1 when it recovers
- Protocols: HTTPS (port 443), WebSocket support for real-time job updates
Implementation options:
- Cloud: AWS Route 53, Azure Traffic Manager, Google Cloud Load Balancing
- On-premises: F5 BIG-IP, HAProxy, NGINX Plus
- Hybrid: Cloudflare Load Balancing, Akamai Global Traffic Management
Operator install with external EDB Postgres (sample namespace / cluster: edb-postgres / postgresql):
- See
aap-deploy/README.md(overview) - See
aap-deploy/openshift/README.md(subscription +AnsibleAutomationPlatformCR)
For OpenShift, AAP is deployed on separate OpenShift clusters for high availability and geographic distribution. For RHEL you can do a single install across datacenters however you MUST TURN OFF THE SERVICES ON DC2.
- Namespace:
ansible-automation-platform - AAP Gateway: 3 replicas for HA
- AAP Controller: 3 replicas for HA
- Automation Hub: 2 replicas
- Database: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
- Route:
aap-dc1.apps.ocp1.example.com - State: Active, serving production traffic
- Namespace:
ansible-automation-platform - AAP Gateway: Scaled to 0 (or 3 replicas if pre-warmed)
- AAP Controller: Scaled to 0 (or 3 replicas if pre-warmed)
- Automation Hub: Scaled to 0 (or 2 replicas if pre-warmed)
- Database: PostgreSQL cluster (1 designated primary + 2 replicas) managed by EDB operator
- Route:
aap-dc2.apps.ocp2.example.com - State: Standby, ready for failover
Scaling strategy:
- Cold standby: AAP scaled to 0, database replicating (5-10 min activation time)
- Warm standby: AAP running with 1 replica each, scaled up during failover (2-3 min activation)
- Hot standby: AAP fully scaled, ready for immediate traffic (30 sec activation)
The AAP databases are replicated from active to passive datacenter:
- Method: PostgreSQL logical replication (Active → Passive)
- Note: AAP's internal database uses logical replication for flexibility
- Direction: DC1 (Active) → DC2 (Passive)
- Mode: Asynchronous replication with minimal lag
- Shared Data:
- Job templates
- Inventory and host information
- Credentials (encrypted)
- Execution history and logs
- RBAC settings
- Workflow definitions
- Failover: DC2 database promoted to read-write during failover
- Failback: Data synchronized back to DC1 when it recovers
Lag monitoring:
- Monitor
pg_stat_replicationfor lag metrics - Alert if lag exceeds 30 seconds
- Dashboard display of replication health
EDB-managed application database clusters use physical replication:
- Method: PostgreSQL physical replication via streaming replication and WAL shipping
- Primary Method: Streaming replication from Primary to Designated Primary
- Fallback Method: WAL shipping via S3/object store (continuous WAL archiving)
- Within Cluster: Hot standby replicas use streaming replication from primary/designated primary
- Mode: Asynchronous streaming with optional synchronous mode
- Benefits:
- Block-level replication (exact byte-for-byte replica)
- Faster failover times
- Lower overhead than logical replication
- Supports all PostgreSQL features
Replication topology:
DC1 Primary Cluster:
postgresql-1 (primary) → postgresql-2 (hot standby)
→ postgresql-3 (hot standby)
→ DC2 Designated Primary (streaming)
→ S3 bucket (WAL archive)
DC2 Replica Cluster:
postgresql-replica-1 (designated primary) → postgresql-replica-2 (hot standby)
→ postgresql-replica-3 (hot standby)
→ S3 bucket (WAL archive)
Users and automation clients connect to AAP through the global load balancer:
- URL:
https://aap.example.com - Protocol: HTTPS/443 with WebSocket support
- Load Balancing: Active-Passive (priority-based)
- Active Target: DC1 AAP (100% traffic when healthy)
- Passive Target: DC2 AAP (standby, only receives traffic during failover)
- Health Checks:
- Layer 7 health checks to AAP Controller
/api/v2/ping/endpoint - Frequency: Every 10 seconds
- Threshold: 3 consecutive failures trigger failover
- Layer 7 health checks to AAP Controller
- Session Affinity: Sticky sessions for long-running jobs
- TLS Termination: At load balancer or end-to-end encryption
- Failover Time: 30-60 seconds (health check detection + DNS propagation)
Network requirements:
- Bandwidth: 100 Mbps minimum, 1 Gbps recommended
- Latency: <50ms user-to-GLB, <100ms GLB-to-AAP
- Availability: 99.99% uptime SLA
AAP can only talk to one Read-Write (RW) database at a time:
- Protocol: PostgreSQL wire protocol (port 5432)
- Access:
- Within OpenShift cluster: Via ClusterIP Services (
postgresql-rw.edb-postgres.svc.cluster.local) - Cross-cluster: Via OpenShift Routes with TLS passthrough or LoadBalancer services
- Within OpenShift cluster: Via ClusterIP Services (
- Authentication:
- Certificate-based (mutual TLS) - recommended
- Password authentication (stored in Kubernetes secrets)
- Encryption: TLS/SSL enforced for all connections
- Connection Pooling: PgBouncer for efficient connection management
- Pool size: 100 connections per AAP instance
- Pool mode: Transaction pooling
- Idle timeout: 600 seconds
Connection failover:
- AAP uses
-rwservice which automatically points to current primary - During failover, EDB operator updates service endpoints
- AAP reconnects automatically on connection failure
- Connection retry logic: 3 attempts with exponential backoff
- Method: PostgreSQL physical replication (streaming + WAL shipping)
- Primary Mechanism: Streaming replication from Primary to Designated Secondaries
- Fallback Mechanism: WAL shipping via S3/object store
- Direction: DC1 (Primary Cluster) → DC2 (Replica Cluster)
- Network:
- Encrypted tunnel (VPN/Direct Connect/WAN) for streaming replication
- HTTPS for S3 WAL archiving
- Dedicated VLAN or VPC peering recommended
- Replication Type:
- Asynchronous (default) - better performance
- Synchronous (optional) - zero data loss guarantee
- Lag Monitoring:
- Both AAP instances monitor replication lag via EDB operator metrics
- Prometheus metrics:
cnpg_pg_replication_lag - Grafana dashboards display real-time lag
- Alerting:
- Alerts triggered if lag exceeds threshold (e.g., 30 seconds)
- PagerDuty integration for critical alerts
- Automatic Service Updates:
- EDB operator automatically updates
-rwservice during failover - Service endpoints updated within 5-10 seconds
- EDB operator automatically updates
- Cross-Cluster Limitation:
- Automated failover across OpenShift clusters must be handled externally
- Integration via AAP automation or EDB Failover Manager (EFM)
Network requirements for replication:
- Bandwidth: 10 Mbps minimum, 100 Mbps recommended
- Latency: <100ms for streaming replication
- Jitter: <10ms
- Packet loss: <0.1%
Replication slot configuration:
# In DC1 primary cluster
spec:
replicationSlots:
highAvailability:
enabled: true
slotPrefix: _cnpg_
updateInterval: 30For EDB-Managed Application Databases:
-
Application → AAP Controller
- User or API client submits job/workflow
- AAP Controller receives request
-
AAP Controller → DC1 Primary Database (via
-rwservice)- AAP writes job data, inventory updates, credentials
- Connection via
postgresql-rw.edb-postgres.svc.cluster.local:5432
-
DC1 Primary → DC1 Hot Standby Replicas (streaming replication within cluster)
- Primary replicates to 2 hot standby instances
- Replication lag: <100ms
- Used for read-only queries and HA
-
DC1 Primary → DC2 Designated Primary (streaming replication across clusters)
- Replication via OpenShift Route with TLS passthrough
- Typical lag: 1-5 seconds (depends on WAN latency)
- Used for DR failover
-
DC1 Primary → S3/Object Store (continuous WAL archiving - fallback)
- WAL files uploaded every 60 seconds or 16MB (whichever first)
- Used for PITR and fallback replication
- Retention: 30 days
-
DC2 Designated Primary → DC2 Hot Standby Replicas (streaming replication within cluster)
- DC2 designated primary replicates to 2 hot standby instances
- Ensures DC2 can serve reads and has HA ready for promotion
Data flow diagram:
User/API → GLB → AAP DC1 → PostgreSQL DC1 Primary
↓
┌──────┴──────┬──────────┬─────────┐
↓ ↓ ↓ ↓
DC1 Standby DC1 Standby DC2 DP S3 WAL
1 2 ↓ Archive
├─────────┬─────────┐
↓ ↓ ↓
DC2 Standby DC2 Standby (backup)
1 2
EDB-Managed Clusters:
DC1 Primary Cluster:
- Write operations: Via
postgresql-rwservice (routes to primary instance) - Read operations (HA): Via
postgresql-roservice (routes to hot standby replicas) - Read operations (any): Via
postgresql-rservice (routes to any instance including primary)
DC2 Replica Cluster:
- Read operations only: Via
postgresql-replica-roservice (routes to designated primary or replicas) - Cannot accept writes unless promoted during failover
- Used for:
- Read-only analytics queries (offload from DC1)
- DR testing and validation
- Backup source (to reduce load on DC1)
Load Balancing:
- EDB operator manages service routing automatically
- Round-robin load balancing across available read replicas
- Health checks ensure only healthy instances receive traffic
Service Behavior During Failover:
- EDB operator automatically updates
-rwservice to point to newly promoted primary - Applications experience seamless redirection without connection string changes
- Read-only services updated to reflect new topology
- Typical service update time: 5-10 seconds
Query routing strategy:
Write queries → Always to -rw service → Primary instance
Read queries (low latency) → -r service → Any instance (including primary)
Read queries (HA) → -ro service → Hot standby replicas only
Analytics queries → DC2 -replica-ro → Offload from production
EDB-Managed PostgreSQL Backups:
-
Scheduled backup job (initiated by AAP or CronJob via EDB operator)
- Daily full backup: 2:00 AM UTC
- Hourly incremental backups (optional)
- Triggered by
Backupcustom resource
-
Backup pod created by EDB operator
- Temporary pod spins up with Barman Cloud tools
- Mounts persistent volume for staging (if needed)
- Authenticates to PostgreSQL and S3
-
Database backup streamed to S3/object store (using Barman Cloud)
- Full backup or incremental based on schedule
- Compression: gzip (reduces size by ~70%)
- Encryption: AES-256 (S3 server-side or client-side)
-
WAL files continuously archived to S3 (automatic by EDB operator)
- Continuous archiving every 60 seconds or 16MB
- Parallel upload for high write workloads
- Checksum validation on upload
-
WAL archiving serves dual purpose:
- Point-in-time recovery (PITR): Restore to any second within retention window
- Fallback replication mechanism: Replica clusters can recover from WAL archive if streaming replication fails
-
Replica clusters can recover from WAL archive if streaming replication fails
- Automatic fallback when streaming connection lost
- Catchup from WAL archive until streaming restored
- Alerts sent if relying on WAL archive for >5 minutes
-
AAP monitors backup completion via operator metrics
- Prometheus metrics:
cnpg_pg_backup_last_succeeded - Grafana dashboard: Backup status panel
- Integration with external monitoring (PagerDuty, Slack)
- Prometheus metrics:
-
Alerts sent if backup fails
- Immediate alert on backup failure
- Warning alert if backup >36 hours old
- Runbook links provided in alerts
Backup Strategy per Datacenter:
DC1 (Primary):
- Full backups daily + continuous WAL archiving
- S3 bucket:
s3://edb-backups-dc1-prod(primary region: us-east-1) - Retention: 30 days operational, 365 days compliance (Glacier transition)
- Backup source: Prefer hot standby replica (reduce load on primary)
DC2 (Disaster Recovery):
- Independent backups to separate S3 bucket for redundancy
- S3 bucket:
s3://edb-backups-dc2-dr(DR region: us-west-2) - Retention: 30 days
- Backup source: Designated primary (already in read-only mode)
- Cross-region replication from DC1 S3 bucket (optional)
Backup validation:
- Monthly restore test to verify backup integrity
- Automated via
CronJoband validation scripts - Test restores to separate namespace
- Validation: Data integrity checks, connectivity tests, query execution
Recovery scenarios:
- Recent data loss: PITR from WAL archive (RPO: <60 seconds)
- Database corruption: Restore from latest full backup + WAL replay
- Datacenter loss: Restore DC1 from DC2 backups or vice versa
Detailed architecture documentation for AAP on different platforms:
-
RHEL AAP Architecture - AAP on RHEL with systemd services
- Systemd service management
- HAProxy for load balancing
- PostgreSQL on bare metal/VMs
- Manual service orchestration during failover
-
OpenShift AAP Architecture - AAP on OpenShift with operator
- Operator-based lifecycle management
- Native Kubernetes Services for load balancing
- CloudNativePG for PostgreSQL
- Automated pod orchestration during failover
Choosing deployment type:
- Use RHEL if you have existing VM/bare metal infrastructure and prefer traditional management
- Use OpenShift if you want cloud-native orchestration and have Kubernetes expertise
See: EDB Failover Manager Documentation
EFM provides automated database failover detection and orchestration:
Key features:
- Automatic detection of primary database failure
- Promotion of standby to primary within 30-60 seconds
- Virtual IP (VIP) failover for seamless client reconnection
- Integration with AAP scaling scripts
- Email/SNMP notifications
Failover trigger:
- EFM detects primary database failure (3 consecutive health check failures)
- EFM promotes best standby replica to primary
- EFM calls AAP orchestration script:
efm-orchestrated-failover.sh - Script scales down AAP in DC1, scales up AAP in DC2
- GLB health checks detect DC2 AAP healthy, route traffic to DC2
- RTO achieved: <5 minutes
Configuration:
# /etc/edb/efm-4.x/efm.properties
enable.custom.scripts=true
script.post.promotion=/usr/edb/efm-4.x/bin/efm-orchestrated-failover.sh %h %s %a %vSee: AAP Cluster Scripts Documentation
Operational scripts:
scale-aap-up.sh- Scale AAP to operational state in target datacenterscale-aap-down.sh- Scale AAP to zero in inactive datacenterefm-orchestrated-failover.sh- Full DR failover orchestrationvalidate-aap-data.sh- Post-failover data validationmonitor-efm-scripts.sh- EFM integration monitoring
Runbook:
- AAP Cluster Management Runbook - Step-by-step operational procedures
See: DR Scenarios Documentation
Documented failure scenarios:
-
Single Pod Failure (Database or AAP) - Automatic Kubernetes restart
- RTO: <30 seconds
- RPO: 0 (no data loss)
- Automation: Kubernetes liveness/readiness probes
-
Database Cluster Failure (DC1) - EFM automated failover
- RTO: <1 minute
- RPO: <5 seconds
- Automation: EFM promotion + service updates
-
Complete Datacenter Failure (DC1) - Manual failover to DC2
- RTO: <5 minutes
- RPO: <5 seconds
- Automation: AAP playbook or manual script execution
-
Data Corruption (Logical) - Point-in-time recovery
- RTO: 2-4 hours
- RPO: <1 minute (depends on backup schedule)
- Automation: PITR scripts
-
Network Partition (Split-Brain) - Prevention via database role checks
- RTO: N/A (prevention measure)
- RPO: N/A
- Automation: Pre-startup validation scripts
-
Cascading Failures (Both DCs) - Recovery from S3 backups
- RTO: <24 hours
- RPO: <5 minutes
- Automation: Disaster recovery runbook
DR Testing:
- Quarterly DR drills: Automated via
CronJob- see DR Testing Guide - Validation scripts: Data integrity checks post-failover
- RTO/RPO measurement: Automated metrics collection during tests
PostgreSQL (OpenShift):
# Edit cluster.yaml
spec:
instances: 3 # Increase from 2 to 3Apply changes:
oc apply -k db-deploy/sample-cluster/Benefits:
- Increased read capacity (more read replicas)
- Higher availability (more failover candidates)
- Better resource distribution
Considerations:
- More instances = more replication overhead
- Diminishing returns beyond 3-5 instances per cluster
- Network bandwidth requirements increase
PostgreSQL (OpenShift):
# Edit cluster.yaml
spec:
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"AAP (OpenShift):
# Edit ansibleautomationplatform.yaml
spec:
controller:
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"Recommendations:
- PostgreSQL: 2-4 CPU cores, 4-8 GB RAM per instance (typical)
- AAP Controller: 2-4 CPU cores, 4-8 GB RAM per replica
- AAP Hub: 1-2 CPU cores, 2-4 GB RAM per replica
- Monitor resource utilization before scaling up
Resize PostgreSQL PVCs:
# Check current size
oc get pvc -n edb-postgres
# Edit PVC (if StorageClass supports expansion)
oc edit pvc postgresql-1 -n edb-postgres
# Increase storage size in spec.resources.requests.storage
# Operator automatically handles resizeBest practices:
- Plan for 3-6 months of data growth
- Monitor disk usage weekly
- Keep 20% free space minimum
- Use separate volumes for WAL if high write workload
Multi-region deployment:
- Deploy primary cluster in primary region (DC1)
- Deploy replica cluster in DR region (DC2)
- Configure cross-region replication via OpenShift Routes or VPN
- Set up S3 buckets in both regions for backups
- Configure cross-region S3 replication for backup redundancy
Latency considerations:
- Streaming replication: Works well up to 100ms latency
- High latency (>100ms): Consider asynchronous replication only
- Very high latency (>500ms): Use WAL shipping as primary method
See: OpenShift Installation Guide - Scaling
- Main README - Architecture overview and quick links
- RHEL AAP Architecture - AAP on RHEL detailed architecture
- OpenShift AAP Architecture - AAP on OpenShift detailed architecture
- OpenShift Installation - Detailed OpenShift deployment
- RHEL Installation with TPA - Automated RHEL deployment
- Cross-Cluster Replication - DC1→DC2 replication setup
- DR Scenarios - 6 documented failure scenarios
- DR Testing Guide - Complete testing framework
- EDB Failover Manager - EFM integration
- Split-Brain Prevention - Database role validation
- Operations Runbook - Day-to-day procedures
- Troubleshooting - Common issues and diagnostics
- Scripts Reference - All automation scripts
- AAP Scaling Scripts - scale-aap-up.sh, scale-aap-down.sh
- DR Failover Scripts - efm-orchestrated-failover.sh, dr-failover-test.sh
- Validation Scripts - validate-aap-data.sh, measure-rto-rpo.sh
Architecture Documentation Complete
For questions or improvements, see CONTRIBUTING.md or open an issue on GitHub.
