diff --git a/README.md b/README.md
index 79793d8..7cbdd86 100644
--- a/README.md
+++ b/README.md
@@ -8,50 +8,49 @@
## Table of Contents
- [Overview](#overview)
+- [Quick Links](#quick-links)
- [Installation](#installation)
- - [RHEL / hosts — Trusted Postgres Architect (TPA) (recommended)](docs/install-tpa.md)
- - [OpenShift — EDB operator & manual](docs/install-kubernetes-manual.md)
- - [OpenShift — Kustomize manifests (`db-deploy/`)](db-deploy/README.md)
- - [OpenShift — AAP operator with external Postgres (`aap-deploy/`)](aap-deploy/README.md)
- - [RHEL manual installation](docs/install-rhel-manual.md)
- - [OpenShift manual installation](docs/install-kubernetes-manual.md)
- [Architecture](#architecture)
-- [Component Details](#component-details)
- - [Global Load Balancer](#global-load-balancer)
- - [Ansible Automation Platform (AAP)](#ansible-automation-platform-aap)
-- [Network Connectivity](#network-connectivity)
- - [User to AAP (via Global Load Balancer)](#user-to-aap-via-global-load-balancer)
- - [AAP to PostgreSQL Databases](#aap-to-postgresql-databases)
- - [Inter-Datacenter Replication](#inter-datacenter-replication)
- - [Write Operations (Normal State)](#write-operations-normal-state)
- - [Read Operations](#read-operations)
- - [Backup Flow](#backup-flow)
-- [AAP Deployment Architecture](#aap-deployment-architecture)
- - [RHEL AAP Architecture](docs/rhel-aap-architecture.md)
- - [OpenShift AAP Architecture](docs/openshift-aap-architecture.md)
-- [AAP Cluster Management](#aap-cluster-management)
- - [Integration with EDB EFM (Enterprise Failover Manager)](#integration-with-edb-efm-enterprise-failover-manager)
-- [AAP cluster management — runbook](docs/manual-scripts-doc.md)
- - [AAP cluster scripts (`scripts/README.md`)](scripts/README.md)
-- [EFM Integration (EDB Failover Manager)](docs/enterprisefailovermanager.md)
-- [Troubleshooting](docs/troubleshooting.md)
-- [Disaster Recovery Scenarios](#disaster-recovery-scenarios)
- - [Full scenarios doc](docs/dr-scenarios.md)
-- [Scaling Considerations](#scaling-considerations)
- - [Horizontal & vertical scaling (OpenShift)](docs/install-kubernetes-manual.md#scaling-considerations)
-- [EDB Postgres on OpenShift Architecture](docs/install-kubernetes-manual.md#edb-postgres-on-openshift-architecture)
+- [Operations](#operations)
+- [Contributing](#contributing)
## Overview
-This document describes the architecture of EnterpriseDB Postgres deployed Active/Passive
-across two clusters in different datacenters with in datacenter replication for the
-Ansible Automation Platform (AAP). This will achieve a **NEAR** HA type architecture,
-especially for failover to the databases syncing in region/datacenter.
-
-A DR scenario should be exactly for if there is a catastrophic failure. Failing to an
-in-site database should cause little to no intervention needed at the application layer.
-The main thing to note is for a DR failover any running jobs will be lost, however if
-it fails in site, the jobs should continue to run UNLESS the controller has a failure.
+This repository provides a complete solution for deploying Ansible Automation Platform (AAP) with
+EnterpriseDB PostgreSQL in a multi-datacenter Active/Passive configuration. The architecture
+achieves **near-HA** with automatic failover within datacenters and orchestrated failover across
+datacenters.
+
+**Key Features:**
+- ✅ **Multi-datacenter HA/DR** - Active-Passive across two datacenters
+- ✅ **Automatic failover** - In-datacenter failover <1 minute
+- ✅ **PostgreSQL replication** - Physical streaming + WAL archiving
+- ✅ **AAP orchestration** - Automated scaling during failover
+- ✅ **Comprehensive testing** - Automated DR testing framework
+- ✅ **Production-ready** - Security, monitoring, backup strategies
+
+**Target RTO/RPO:**
+- **In-datacenter failover:** RTO <1 minute, RPO <5 seconds
+- **Cross-datacenter failover:** RTO <5 minutes, RPO <5 seconds
+
+## Quick Links
+
+### Getting Started
+- **[🚀 Quick Start Guide](docs/quick-start-guide.md)** - Deploy in 15-30 minutes
+- **[📚 Documentation Index](docs/INDEX.md)** - Complete documentation organized by topic
+- **[🏗️ Architecture Details](docs/architecture.md)** - Comprehensive architecture documentation
+
+### Deployment
+- **[OpenShift Deployment](docs/install-kubernetes-manual.md)** - Operator-based deployment
+- **[RHEL with TPA](docs/install-tpa.md)** - Automated deployment with Trusted Postgres Architect
+- **[Database Deploy (Kustomize)](db-deploy/README.md)** - GitOps-friendly manifests
+- **[AAP Deploy (Kustomize)](aap-deploy/README.md)** - AAP operator deployment
+
+### Operations
+- **[Operations Runbook](docs/manual-scripts-doc.md)** - Day-to-day operational procedures
+- **[Scripts Reference](scripts/README.md)** - All automation scripts documented
+- **[DR Testing Guide](docs/dr-testing-guide.md)** - Complete DR testing framework
+- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions
## Installation
@@ -62,6 +61,16 @@ from EnterpriseDB for Postgres on **bare metal, cloud instances, or SSH-managed
TPA does **not** deploy the **EDB Postgres on OpenShift** operator; for Postgres **on OpenShift
as pods**, use the operator and manual/GitOps steps in this repo.
+### Installation Quick Reference
+
+| Platform | Time | Guide |
+|----------|------|-------|
+| **OpenShift** | 15 min | [Quick Start - OpenShift](docs/quick-start-guide.md#quick-start-openshift-15-minutes) |
+| **RHEL with TPA** | 20 min | [Quick Start - RHEL](docs/quick-start-guide.md#quick-start-rhel-with-tpa-20-minutes) |
+| **Local CRC** | 30 min | [Quick Start - CRC](docs/quick-start-guide.md#quick-start-local-testing-with-crc-30-minutes) |
+
+### Detailed Installation Guides
+
| Area | Description | Guide |
|------|-------------|--------|
| **RHEL / hosts (TPA)** *(recommended)* | `tpaexec` workflows for supported platforms (bare metal, cloud, Docker for testing) | [TPA install](docs/install-tpa.md)
[RHEL / Ansible entry](docs/install-tpa.md#rhel-tpa-ansible)
[TPA on GitHub](https://github.com/EnterpriseDB/tpa)
[EDB TPA docs](https://www.enterprisedb.com/docs/tpa/latest/) |
@@ -74,147 +83,123 @@ as pods**, use the operator and manual/GitOps steps in this repo.
| **Troubleshooting** | Diagnostics and issue resolution | [Troubleshooting](docs/troubleshooting.md) |
| **AAP cluster scripts & runbook** | Automation and operational procedures | [Scripts](scripts/README.md)
[Runbook](docs/manual-scripts-doc.md) |
-## Architecture
+## Architecture
+
+### Architecture Overview
+
+The solution implements a **multi-datacenter Active/Passive architecture** with:
+
+- **Two datacenters:** DC1 (active), DC2 (passive/DR)
+- **PostgreSQL replication:** Physical streaming replication + WAL archiving to S3
+- **AAP deployment:** Separate clusters in each datacenter, scaled based on active/passive state
+- **Failover orchestration:** EDB Failover Manager (EFM) integration with AAP scaling scripts
+- **Global load balancer:** Routes traffic to active datacenter

-## Component Details
-
-### Global Load Balancer
-
-The global load balancer provides a single entry point for AAP access:
-
-- **DNS**: `aap.example.com`
-- **Type**: Active-Passive (DC1 primary, DC2 standby)
-- **Health Checks**: Monitors AAP Controller availability in both datacenters
-- **Failover**: Automatic failover to DC2 if DC1 becomes unavailable
-- **Routing**: Priority-based routing (100% traffic to DC1 when healthy)
-- **Failback**: Automatic or manual failback to DC1 when it recovers
-- **Protocols**: HTTPS (port 443), WebSocket support for real-time job updates
-
-### Ansible Automation Platform (AAP)
-
-**Operator install with external EDB Postgres** (sample namespace / cluster: `edb-postgres` / `postgresql`):
-- See **[`aap-deploy/README.md`](aap-deploy/README.md)** (overview)
-- See **[`aap-deploy/openshift/README.md`](aap-deploy/openshift/README.md)** (subscription + `AnsibleAutomationPlatform` CR)
-
-For OpenShift, AAP is deployed on **separate OpenShift clusters** for high availability and
-geographic distribution. For RHEL you can do a single install across datacenters however you
-**MUST TURN OFF THE SERVICES ON THE SECONDARY SITE**
-
-#### Datacenter 1 - AAP Instance
-- **Namespace**: `ansible-automation-platform`
-- **AAP Gateway**: 3 replicas for HA
-- **AAP Controller**: 3 replicas for HA
-- **Automation Hub**: 2 replicas
-- **Database**: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
-- **Route**: `aap-dc1.apps.ocp1.example.com`
-
-#### Datacenter 2 - AAP Instance (scaled down)
-- **Namespace**: `ansible-automation-platform`
-- **AAP Gateway**: 3 replicas for HA
-- **AAP Controller**: 3 replicas for HA
-- **Automation Hub**: 2 replicas
-- **Database**: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
-- **Route**: `aap-dc2.apps.ocp2.example.com`
-
-#### AAP Database Replication
-
-The AAP databases are replicated from active to passive datacenter:
-- **Method**: PostgreSQL logical replication (Active → Passive) - *Note: AAP's internal database uses logical replication for flexibility*
-- **Direction**: DC1 (Active) → DC2 (Passive)
-- **Mode**: Asynchronous replication with minimal lag
-- **Shared Data**: Job templates, inventory, credentials, execution history
-- **Failover**: DC2 database promoted to read-write during failover
-- **Failback**: Data synchronized back to DC1 when it recovers
-
-#### EDB-Managed PostgreSQL Cluster Replication
-
-EDB-managed application database clusters use physical replication:
-- **Method**: PostgreSQL physical replication via streaming replication and WAL shipping
-- **Primary Method**: Streaming replication from Primary to Designated Primary
-- **Fallback Method**: WAL shipping via S3/object store (continuous WAL archiving)
-- **Within Cluster**: Hot standby replicas use streaming replication from primary/designated primary
-- **Mode**: Asynchronous streaming with optional synchronous mode
-- **Benefits**: Block-level replication, faster failover, exact byte-for-byte replica
-
-## Network Connectivity
-
-### User to AAP (via Global Load Balancer)
-
-Users and automation clients connect to AAP through the global load balancer:
-- **URL**: `https://aap.example.com`
-- **Protocol**: HTTPS/443 with WebSocket support
-- **Load Balancing**: Active-Passive (priority-based)
-- **Active Target**: DC1 AAP (100% traffic when healthy)
-- **Passive Target**: DC2 AAP (standby, only receives traffic during failover)
-- **Health Checks**: Layer 7 health checks to AAP Controller endpoints
-- **Session Affinity**: Sticky sessions for long-running jobs
-- **TLS Termination**: At load balancer or end-to-end encryption
-
-### AAP to PostgreSQL Databases
-
-AAP can only talk to one Read Write(RW) database at a time:
-- **Protocol**: PostgreSQL wire protocol (port 5432)
-- **Access**: Via OpenShift Services (ClusterIP within cluster, Routes/LoadBalancer for remote)
-- **Authentication**: Certificate-based or password authentication
-- **Encryption**: TLS/SSL enforced
-- **Connection Pooling**: PgBouncer for efficient connection management
-
-### Inter-Datacenter Replication
-
-#### EDB-Managed Application Database Replication
-- **Method**: PostgreSQL physical replication (streaming + WAL shipping)
-- **Primary Mechanism**: Streaming replication from Primary to Designated Secondaries
-- **Fallback Mechanism**: WAL shipping via S3/object store
-- **Direction**: DC1 (Primary Cluster) → DC2 (Replica Cluster)
-- **Network**: Encrypted tunnel (VPN/Direct Connect/WAN) for streaming replication
-- **Replication Type**: Asynchronous (default) or synchronous (configurable)
-- **Lag Monitoring**: Both AAP instances monitor replication lag via EDB operator metrics
-- **Alerting**: Alerts triggered if lag exceeds threshold (e.g., 30 seconds)
-- **Automatic Service Updates**: EDB operator automatically updates `-rw` service during failover
-- **Cross-Cluster Limitation**: Automated failover across OpenShift clusters must be handled externally (via AAP or higher-level orchestration)
-
-### Write Operations (Normal State)
-
-**For EDB-Managed Application Databases:**
-1. Application → AAP Controller
-2. AAP Controller → DC1 Primary Database (via `-rw` service)
-3. DC1 Primary → DC1 Hot Standby Replicas (streaming replication within cluster)
-4. DC1 Primary → DC2 Designated Primary (streaming replication across clusters)
-5. DC1 Primary → S3/Object Store (continuous WAL archiving - fallback)
-6. DC2 Designated Primary → DC2 Hot Standby Replicas (streaming replication within cluster)
-
-### Read Operations
-
-**EDB-Managed Clusters:**
-- **DC1 Primary Cluster**:
- - Write operations via `prod-db-rw` service (routes to primary)
- - Read operations via `prod-db-ro` service (routes to hot standby replicas)
- - Read operations via `prod-db-r` service (routes to any instance)
-- **DC2 Replica Cluster**:
- - Read operations only via `prod-db-replica-ro` service (routes to designated primary or replicas)
- - Cannot accept writes unless promoted
-- **Load Balancing**: EDB operator manages service routing automatically
-
-**Service Behavior During Failover:**
-- EDB operator automatically updates `-rw` service to point to newly promoted primary
-- Applications experience seamless redirection without connection string changes
-
-### Backup Flow
-
-**EDB-Managed PostgreSQL Backups:**
-1. Scheduled backup job (initiated by AAP or CronJob via EDB operator)
-2. Backup pod created by EDB operator
-3. Database backup streamed to S3/object store (using Barman Cloud)
-4. WAL files continuously archived to S3 (automatic by EDB operator)
-5. WAL archiving serves dual purpose:
- - Point-in-time recovery (PITR)
- - Fallback replication mechanism for replica clusters
-6. Replica clusters can recover from WAL archive if streaming replication fails
-7. AAP monitors backup completion via operator metrics
-8. Alerts sent if backup fails
-
-**Backup Strategy per Datacenter:**
-- **DC1**: Full backups + continuous WAL archiving to S3 bucket (primary region)
-- **DC2**: Independent backups to separate S3 bucket (DR region) for redundancy
+### Key Components
+
+1. **Global Load Balancer** - Single entry point with health check-based routing
+2. **Ansible Automation Platform (AAP)** - Deployed in both datacenters
+3. **PostgreSQL Clusters** - EDB Postgres Advanced with CloudNativePG operator
+4. **Replication** - Streaming replication DC1→DC2 with S3 WAL archive fallback
+5. **Backup** - Barman Cloud to S3 with 30-day retention and PITR capability
+
+### Architecture Documentation
+
+**📖 [Complete Architecture Documentation](docs/architecture.md)**
+
+Detailed documentation includes:
+- Component details (GLB, AAP, PostgreSQL)
+- Network connectivity and data flow
+- Replication topology and configuration
+- Backup and restore strategies
+- Scaling considerations
+- Deployment architecture for RHEL and OpenShift
+
+**Platform-Specific Architecture:**
+- **[RHEL AAP Architecture](docs/rhel-aap-architecture.md)** - Systemd services, HAProxy, manual orchestration
+- **[OpenShift AAP Architecture](docs/openshift-aap-architecture.md)** - Operators, native services, automated orchestration
+
+## Operations
+
+### Day-to-Day Operations
+
+- **[Operations Runbook](docs/manual-scripts-doc.md)** - Step-by-step operational procedures
+- **[Script Reference](scripts/README.md)** - All automation scripts with usage examples
+- **[Troubleshooting Guide](docs/troubleshooting.md)** - Common issues and diagnostics
+
+### Disaster Recovery
+
+- **[DR Scenarios](docs/dr-scenarios.md)** - 6 documented failure scenarios with procedures
+- **[DR Testing Guide](docs/dr-testing-guide.md)** - Complete testing framework with quarterly drills
+- **[Split-Brain Prevention](docs/split-brain-prevention.md)** - Database role validation and fencing
+- **[EDB Failover Manager](docs/enterprisefailovermanager.md)** - EFM integration and configuration
+
+### Automation Scripts
+
+Located in [`scripts/`](scripts/):
+
+**AAP Management:**
+- `scale-aap-up.sh` - Scale AAP to operational state
+- `scale-aap-down.sh` - Scale AAP to zero (maintenance/DR)
+
+**DR Orchestration:**
+- `efm-orchestrated-failover.sh` - Full automated failover
+- `dr-failover-test.sh` - DR testing automation
+- `validate-aap-data.sh` - Post-failover validation
+- `measure-rto-rpo.sh` - RTO/RPO measurement
+- `generate-dr-report.sh` - Automated DR test reporting
+
+**Pre-commit Hooks:**
+- `hooks/check-script-permissions.sh` - Verify executable permissions
+- `hooks/validate-openshift-manifests.sh` - Validate YAML manifests
+
+## Contributing
+
+We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for:
+
+- Documentation standards
+- Code standards (shell scripts, YAML)
+- Testing requirements
+- Pull request process
+- Commit message guidelines
+
+### Documentation
+
+All documentation is in [`docs/`](docs/):
+
+- **[Documentation Index](docs/INDEX.md)** - Complete documentation organized by topic
+- **[Quick Start Guide](docs/quick-start-guide.md)** - 15-30 minute deployment paths
+- **[Architecture](docs/architecture.md)** - Comprehensive architecture documentation
+
+### Repository Structure
+
+```
+EDB_Testing/
+├── docs/ # All documentation
+│ ├── INDEX.md # Documentation index
+│ ├── quick-start-guide.md # Quick start (15-30 min)
+│ ├── architecture.md # Architecture details
+│ ├── dr-testing-guide.md # DR testing framework
+│ └── ... # Additional guides
+├── db-deploy/ # PostgreSQL deployment manifests
+│ ├── operator/ # CloudNativePG operator
+│ ├── sample-cluster/ # Base cluster manifests
+│ └── cross-cluster/ # DC1→DC2 replication
+├── aap-deploy/ # AAP deployment
+│ ├── openshift/ # OpenShift manifests
+│ └── edb-bootstrap/ # Database initialization
+├── scripts/ # Automation scripts
+│ ├── scale-aap-*.sh # AAP scaling
+│ ├── dr-*.sh # DR orchestration
+│ └── validate-*.sh # Validation scripts
+├── openshift/ # OpenShift-specific configs
+│ └── dr-testing/ # DR testing CronJob
+└── .github/ # CI/CD workflows
+ └── workflows/ # GitHub Actions
+```
+
+---
+
+**Questions?** See [docs/INDEX.md](docs/INDEX.md) for complete documentation or open an issue.
diff --git a/docs/INDEX.md b/docs/INDEX.md
index f24b00f..a50d91e 100644
--- a/docs/INDEX.md
+++ b/docs/INDEX.md
@@ -52,16 +52,27 @@
**Understanding the system:**
-- **[Main Architecture](../README.md#architecture)** - High-level overview with diagram
-- **[RHEL AAP Architecture](rhel-aap-architecture.md)** - AAP on RHEL with systemd services
-- **[OpenShift AAP Architecture](openshift-aap-architecture.md)** - AAP on OpenShift with operator
+| Document | Description | Read Time |
+|----------|-------------|-----------|
+| **[Architecture Overview](architecture.md)** ⭐ **COMPREHENSIVE** | Complete architecture documentation | 45 min |
+| **[Main README Architecture](../README.md#architecture)** | High-level overview with diagram | 5 min |
+| **[RHEL AAP Architecture](rhel-aap-architecture.md)** | AAP on RHEL with systemd services | 10 min |
+| **[OpenShift AAP Architecture](openshift-aap-architecture.md)** | AAP on OpenShift with operator | 10 min |
+
+**[Architecture Overview](architecture.md)** covers:
+- Component details (GLB, AAP, PostgreSQL clusters)
+- Network connectivity and data flow (writes, reads, backups)
+- Replication topology (streaming + WAL archiving)
+- Datacenter configurations (DC1 active, DC2 passive)
+- Scaling strategies (horizontal, vertical, geographic)
+- Backup and restore architecture
**Architecture Decisions:**
- Active-Passive topology (DC1 primary, DC2 standby)
-- Physical streaming replication + WAL archiving
-- CloudNativePG operator for database lifecycle
-- EDB Failover Manager (EFM) for automated failover
-- Global Load Balancer for traffic management
+- Physical streaming replication + WAL archiving to S3
+- CloudNativePG operator for database lifecycle management
+- EDB Failover Manager (EFM) for automated database failover
+- Global Load Balancer for traffic management and health-based routing
---
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..2cc3773
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,676 @@
+# AAP with EDB Postgres Multi-Datacenter Architecture
+
+**Complete architecture documentation for Ansible Automation Platform with EnterpriseDB PostgreSQL**
+
+## Table of Contents
+
+- [Architecture Overview](#architecture-overview)
+- [Component Details](#component-details)
+ - [Global Load Balancer](#global-load-balancer)
+ - [Ansible Automation Platform (AAP)](#ansible-automation-platform-aap)
+ - [AAP Database Replication](#aap-database-replication)
+ - [EDB-Managed PostgreSQL Cluster Replication](#edb-managed-postgresql-cluster-replication)
+- [Network Connectivity](#network-connectivity)
+ - [User to AAP (via Global Load Balancer)](#user-to-aap-via-global-load-balancer)
+ - [AAP to PostgreSQL Databases](#aap-to-postgresql-databases)
+ - [Inter-Datacenter Replication](#inter-datacenter-replication)
+- [Data Flow](#data-flow)
+ - [Write Operations (Normal State)](#write-operations-normal-state)
+ - [Read Operations](#read-operations)
+ - [Backup Flow](#backup-flow)
+- [AAP Deployment Architecture](#aap-deployment-architecture)
+- [AAP Cluster Management](#aap-cluster-management)
+- [Disaster Recovery Scenarios](#disaster-recovery-scenarios)
+- [Scaling Considerations](#scaling-considerations)
+- [Related Architecture Documentation](#related-architecture-documentation)
+
+---
+
+## Architecture Overview
+
+This architecture implements EnterpriseDB Postgres deployed Active/Passive across two clusters in
+different datacenters with in-datacenter replication for the Ansible Automation Platform (AAP).
+This achieves a **NEAR** HA type architecture, especially for failover to the databases syncing
+in region/datacenter.
+
+**Key characteristics:**
+- **Topology:** Active-Passive multi-datacenter
+- **HA Strategy:** In-datacenter automatic failover, cross-datacenter manual failover
+- **Replication:** Physical streaming replication + WAL archiving
+- **RTO Target:** <1 minute (in-datacenter), <5 minutes (cross-datacenter)
+- **RPO Target:** <5 seconds (streaming replication)
+
+A DR scenario should be used when there is a catastrophic failure. Failing to an in-site database
+should cause little to no intervention needed at the application layer. The main thing to note is
+for a DR failover any running jobs will be lost, however if it fails in-site, the jobs should
+continue to run UNLESS the controller has a failure.
+
+### Architecture Diagram
+
+
+
+---
+
+## Component Details
+
+### Global Load Balancer
+
+The global load balancer provides a single entry point for AAP access:
+
+- **DNS**: `aap.example.com`
+- **Type**: Active-Passive (DC1 primary, DC2 standby)
+- **Health Checks**: Monitors AAP Controller availability in both datacenters
+- **Failover**: Automatic failover to DC2 if DC1 becomes unavailable
+- **Routing**: Priority-based routing (100% traffic to DC1 when healthy)
+- **Failback**: Automatic or manual failback to DC1 when it recovers
+- **Protocols**: HTTPS (port 443), WebSocket support for real-time job updates
+
+**Implementation options:**
+- **Cloud:** AWS Route 53, Azure Traffic Manager, Google Cloud Load Balancing
+- **On-premises:** F5 BIG-IP, HAProxy, NGINX Plus
+- **Hybrid:** Cloudflare Load Balancing, Akamai Global Traffic Management
+
+---
+
+### Ansible Automation Platform (AAP)
+
+**Operator install with external EDB Postgres** (sample namespace / cluster: `edb-postgres` / `postgresql`):
+- See **[`aap-deploy/README.md`](../aap-deploy/README.md)** (overview)
+- See **[`aap-deploy/openshift/README.md`](../aap-deploy/openshift/README.md)** (subscription + `AnsibleAutomationPlatform` CR)
+
+For OpenShift, AAP is deployed on **separate OpenShift clusters** for high availability and
+geographic distribution. For RHEL you can do a single install across datacenters however you
+**MUST TURN OFF THE SERVICES ON THE SECONDARY SITE**.
+
+#### Datacenter 1 - AAP Instance (Active)
+
+- **Namespace**: `ansible-automation-platform`
+- **AAP Gateway**: 3 replicas for HA
+- **AAP Controller**: 3 replicas for HA
+- **Automation Hub**: 2 replicas
+- **Database**: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
+- **Route**: `aap-dc1.apps.ocp1.example.com`
+- **State**: Active, serving production traffic
+
+#### Datacenter 2 - AAP Instance (Passive)
+
+- **Namespace**: `ansible-automation-platform`
+- **AAP Gateway**: Scaled to 0 (or 3 replicas if pre-warmed)
+- **AAP Controller**: Scaled to 0 (or 3 replicas if pre-warmed)
+- **Automation Hub**: Scaled to 0 (or 2 replicas if pre-warmed)
+- **Database**: PostgreSQL cluster (1 designated primary + 2 replicas) managed by EDB operator
+- **Route**: `aap-dc2.apps.ocp2.example.com`
+- **State**: Standby, ready for failover
+
+**Scaling strategy:**
+- **Cold standby:** AAP scaled to 0, database replicating (5-10 min activation time)
+- **Warm standby:** AAP running with 1 replica each, scaled up during failover (2-3 min activation)
+- **Hot standby:** AAP fully scaled, ready for immediate traffic (30 sec activation)
+
+---
+
+### AAP Database Replication
+
+The AAP databases are replicated from active to passive datacenter:
+
+- **Method**: PostgreSQL logical replication (Active → Passive)
+ - *Note: AAP's internal database uses logical replication for flexibility*
+- **Direction**: DC1 (Active) → DC2 (Passive)
+- **Mode**: Asynchronous replication with minimal lag
+- **Shared Data**:
+ - Job templates
+ - Inventory and host information
+ - Credentials (encrypted)
+ - Execution history and logs
+ - RBAC settings
+ - Workflow definitions
+- **Failover**: DC2 database promoted to read-write during failover
+- **Failback**: Data synchronized back to DC1 when it recovers
+
+**Lag monitoring:**
+- Monitor `pg_stat_replication` for lag metrics
+- Alert if lag exceeds 30 seconds
+- Dashboard display of replication health
+
+---
+
+### EDB-Managed PostgreSQL Cluster Replication
+
+EDB-managed application database clusters use physical replication:
+
+- **Method**: PostgreSQL physical replication via streaming replication and WAL shipping
+- **Primary Method**: Streaming replication from Primary to Designated Primary
+- **Fallback Method**: WAL shipping via S3/object store (continuous WAL archiving)
+- **Within Cluster**: Hot standby replicas use streaming replication from primary/designated primary
+- **Mode**: Asynchronous streaming with optional synchronous mode
+- **Benefits**:
+ - Block-level replication (exact byte-for-byte replica)
+ - Faster failover times
+ - Lower overhead than logical replication
+ - Supports all PostgreSQL features
+
+**Replication topology:**
+```
+DC1 Primary Cluster:
+ postgresql-1 (primary) → postgresql-2 (hot standby)
+ → postgresql-3 (hot standby)
+ → DC2 Designated Primary (streaming)
+ → S3 bucket (WAL archive)
+
+DC2 Replica Cluster:
+ postgresql-replica-1 (designated primary) → postgresql-replica-2 (hot standby)
+ → postgresql-replica-3 (hot standby)
+ → S3 bucket (WAL archive)
+```
+
+---
+
+## Network Connectivity
+
+### User to AAP (via Global Load Balancer)
+
+Users and automation clients connect to AAP through the global load balancer:
+
+- **URL**: `https://aap.example.com`
+- **Protocol**: HTTPS/443 with WebSocket support
+- **Load Balancing**: Active-Passive (priority-based)
+- **Active Target**: DC1 AAP (100% traffic when healthy)
+- **Passive Target**: DC2 AAP (standby, only receives traffic during failover)
+- **Health Checks**:
+ - Layer 7 health checks to AAP Controller `/api/v2/ping/` endpoint
+ - Frequency: Every 10 seconds
+ - Threshold: 3 consecutive failures trigger failover
+- **Session Affinity**: Sticky sessions for long-running jobs
+- **TLS Termination**: At load balancer or end-to-end encryption
+- **Failover Time**: 30-60 seconds (health check detection + DNS propagation)
+
+**Network requirements:**
+- **Bandwidth**: 100 Mbps minimum, 1 Gbps recommended
+- **Latency**: <50ms user-to-GLB, <100ms GLB-to-AAP
+- **Availability**: 99.99% uptime SLA
+
+---
+
+### AAP to PostgreSQL Databases
+
+AAP can only talk to one Read-Write (RW) database at a time:
+
+- **Protocol**: PostgreSQL wire protocol (port 5432)
+- **Access**:
+ - **Within OpenShift cluster:** Via ClusterIP Services (`postgresql-rw.edb-postgres.svc.cluster.local`)
+ - **Cross-cluster:** Via OpenShift Routes with TLS passthrough or LoadBalancer services
+- **Authentication**:
+ - Certificate-based (mutual TLS) - recommended
+ - Password authentication (stored in Kubernetes secrets)
+- **Encryption**: TLS/SSL enforced for all connections
+- **Connection Pooling**: PgBouncer for efficient connection management
+ - Pool size: 100 connections per AAP instance
+ - Pool mode: Transaction pooling
+ - Idle timeout: 600 seconds
+
+**Connection failover:**
+- AAP uses `-rw` service which automatically points to current primary
+- During failover, EDB operator updates service endpoints
+- AAP reconnects automatically on connection failure
+- Connection retry logic: 3 attempts with exponential backoff
+
+---
+
+### Inter-Datacenter Replication
+
+#### EDB-Managed Application Database Replication
+
+- **Method**: PostgreSQL physical replication (streaming + WAL shipping)
+- **Primary Mechanism**: Streaming replication from Primary to Designated Secondaries
+- **Fallback Mechanism**: WAL shipping via S3/object store
+- **Direction**: DC1 (Primary Cluster) → DC2 (Replica Cluster)
+- **Network**:
+ - Encrypted tunnel (VPN/Direct Connect/WAN) for streaming replication
+ - HTTPS for S3 WAL archiving
+ - Dedicated VLAN or VPC peering recommended
+- **Replication Type**:
+ - Asynchronous (default) - better performance
+ - Synchronous (optional) - zero data loss guarantee
+- **Lag Monitoring**:
+ - Both AAP instances monitor replication lag via EDB operator metrics
+ - Prometheus metrics: `cnpg_pg_replication_lag`
+ - Grafana dashboards display real-time lag
+- **Alerting**:
+ - Alerts triggered if lag exceeds threshold (e.g., 30 seconds)
+ - PagerDuty integration for critical alerts
+- **Automatic Service Updates**:
+ - EDB operator automatically updates `-rw` service during failover
+ - Service endpoints updated within 5-10 seconds
+- **Cross-Cluster Limitation**:
+ - Automated failover across OpenShift clusters must be handled externally
+ - Integration via AAP automation or EDB Failover Manager (EFM)
+
+**Network requirements for replication:**
+- **Bandwidth**: 10 Mbps minimum, 100 Mbps recommended
+- **Latency**: <100ms for streaming replication
+- **Jitter**: <10ms
+- **Packet loss**: <0.1%
+
+**Replication slot configuration:**
+```yaml
+# In DC1 primary cluster
+spec:
+ replicationSlots:
+ highAvailability:
+ enabled: true
+ slotPrefix: _cnpg_
+ updateInterval: 30
+```
+
+---
+
+## Data Flow
+
+### Write Operations (Normal State)
+
+**For EDB-Managed Application Databases:**
+
+1. **Application → AAP Controller**
+ - User or API client submits job/workflow
+ - AAP Controller receives request
+
+2. **AAP Controller → DC1 Primary Database** (via `-rw` service)
+ - AAP writes job data, inventory updates, credentials
+ - Connection via `postgresql-rw.edb-postgres.svc.cluster.local:5432`
+
+3. **DC1 Primary → DC1 Hot Standby Replicas** (streaming replication within cluster)
+ - Primary replicates to 2 hot standby instances
+ - Replication lag: <100ms
+ - Used for read-only queries and HA
+
+4. **DC1 Primary → DC2 Designated Primary** (streaming replication across clusters)
+ - Replication via OpenShift Route with TLS passthrough
+ - Typical lag: 1-5 seconds (depends on WAN latency)
+ - Used for DR failover
+
+5. **DC1 Primary → S3/Object Store** (continuous WAL archiving - fallback)
+ - WAL files uploaded every 60 seconds or 16MB (whichever first)
+ - Used for PITR and fallback replication
+ - Retention: 30 days
+
+6. **DC2 Designated Primary → DC2 Hot Standby Replicas** (streaming replication within cluster)
+ - DC2 designated primary replicates to 2 hot standby instances
+ - Ensures DC2 can serve reads and has HA ready for promotion
+
+**Data flow diagram:**
+```
+User/API → GLB → AAP DC1 → PostgreSQL DC1 Primary
+ ↓
+ ┌──────┴──────┬──────────┬─────────┐
+ ↓ ↓ ↓ ↓
+ DC1 Standby DC1 Standby DC2 DP S3 WAL
+ 1 2 ↓ Archive
+ ├─────────┬─────────┐
+ ↓ ↓ ↓
+ DC2 Standby DC2 Standby (backup)
+ 1 2
+```
+
+---
+
+### Read Operations
+
+**EDB-Managed Clusters:**
+
+**DC1 Primary Cluster:**
+- **Write operations:** Via `postgresql-rw` service (routes to primary instance)
+- **Read operations (HA):** Via `postgresql-ro` service (routes to hot standby replicas)
+- **Read operations (any):** Via `postgresql-r` service (routes to any instance including primary)
+
+**DC2 Replica Cluster:**
+- **Read operations only:** Via `postgresql-replica-ro` service (routes to designated primary or replicas)
+- **Cannot accept writes** unless promoted during failover
+- Used for:
+ - Read-only analytics queries (offload from DC1)
+ - DR testing and validation
+ - Backup source (to reduce load on DC1)
+
+**Load Balancing:**
+- EDB operator manages service routing automatically
+- Round-robin load balancing across available read replicas
+- Health checks ensure only healthy instances receive traffic
+
+**Service Behavior During Failover:**
+- EDB operator automatically updates `-rw` service to point to newly promoted primary
+- Applications experience seamless redirection without connection string changes
+- Read-only services updated to reflect new topology
+- Typical service update time: 5-10 seconds
+
+**Query routing strategy:**
+```
+Write queries → Always to -rw service → Primary instance
+Read queries (low latency) → -r service → Any instance (including primary)
+Read queries (HA) → -ro service → Hot standby replicas only
+Analytics queries → DC2 -replica-ro → Offload from production
+```
+
+---
+
+### Backup Flow
+
+**EDB-Managed PostgreSQL Backups:**
+
+1. **Scheduled backup job** (initiated by AAP or CronJob via EDB operator)
+ - Daily full backup: 2:00 AM UTC
+ - Hourly incremental backups (optional)
+ - Triggered by `Backup` custom resource
+
+2. **Backup pod created by EDB operator**
+ - Temporary pod spins up with Barman Cloud tools
+ - Mounts persistent volume for staging (if needed)
+ - Authenticates to PostgreSQL and S3
+
+3. **Database backup streamed to S3/object store** (using Barman Cloud)
+ - Full backup or incremental based on schedule
+ - Compression: gzip (reduces size by ~70%)
+ - Encryption: AES-256 (S3 server-side or client-side)
+
+4. **WAL files continuously archived to S3** (automatic by EDB operator)
+ - Continuous archiving every 60 seconds or 16MB
+ - Parallel upload for high write workloads
+ - Checksum validation on upload
+
+5. **WAL archiving serves dual purpose:**
+ - **Point-in-time recovery (PITR):** Restore to any second within retention window
+ - **Fallback replication mechanism:** Replica clusters can recover from WAL archive if streaming replication fails
+
+6. **Replica clusters can recover from WAL archive** if streaming replication fails
+ - Automatic fallback when streaming connection lost
+ - Catchup from WAL archive until streaming restored
+ - Alerts sent if relying on WAL archive for >5 minutes
+
+7. **AAP monitors backup completion** via operator metrics
+ - Prometheus metrics: `cnpg_pg_backup_last_succeeded`
+ - Grafana dashboard: Backup status panel
+ - Integration with external monitoring (PagerDuty, Slack)
+
+8. **Alerts sent if backup fails**
+ - Immediate alert on backup failure
+ - Warning alert if backup >36 hours old
+ - Runbook links provided in alerts
+
+**Backup Strategy per Datacenter:**
+
+**DC1 (Primary):**
+- Full backups daily + continuous WAL archiving
+- S3 bucket: `s3://edb-backups-dc1-prod` (primary region: us-east-1)
+- Retention: 30 days operational, 365 days compliance (Glacier transition)
+- Backup source: Prefer hot standby replica (reduce load on primary)
+
+**DC2 (Disaster Recovery):**
+- Independent backups to separate S3 bucket for redundancy
+- S3 bucket: `s3://edb-backups-dc2-dr` (DR region: us-west-2)
+- Retention: 30 days
+- Backup source: Designated primary (already in read-only mode)
+- Cross-region replication from DC1 S3 bucket (optional)
+
+**Backup validation:**
+- Monthly restore test to verify backup integrity
+- Automated via `CronJob` and validation scripts
+- Test restores to separate namespace
+- Validation: Data integrity checks, connectivity tests, query execution
+
+**Recovery scenarios:**
+- **Recent data loss:** PITR from WAL archive (RPO: <60 seconds)
+- **Database corruption:** Restore from latest full backup + WAL replay
+- **Datacenter loss:** Restore DC1 from DC2 backups or vice versa
+
+---
+
+## AAP Deployment Architecture
+
+Detailed architecture documentation for AAP on different platforms:
+
+- **[RHEL AAP Architecture](rhel-aap-architecture.md)** - AAP on RHEL with systemd services
+ - Systemd service management
+ - HAProxy for load balancing
+ - PostgreSQL on bare metal/VMs
+ - Manual service orchestration during failover
+
+- **[OpenShift AAP Architecture](openshift-aap-architecture.md)** - AAP on OpenShift with operator
+ - Operator-based lifecycle management
+ - Native Kubernetes Services for load balancing
+ - CloudNativePG for PostgreSQL
+ - Automated pod orchestration during failover
+
+**Choosing deployment type:**
+- **Use RHEL** if you have existing VM/bare metal infrastructure and prefer traditional management
+- **Use OpenShift** if you want cloud-native orchestration and have Kubernetes expertise
+
+---
+
+## AAP Cluster Management
+
+### Integration with EDB EFM (Enterprise Failover Manager)
+
+**See:** [EDB Failover Manager Documentation](enterprisefailovermanager.md)
+
+EFM provides automated database failover detection and orchestration:
+
+**Key features:**
+- Automatic detection of primary database failure
+- Promotion of standby to primary within 30-60 seconds
+- Virtual IP (VIP) failover for seamless client reconnection
+- Integration with AAP scaling scripts
+- Email/SNMP notifications
+
+**Failover trigger:**
+1. EFM detects primary database failure (3 consecutive health check failures)
+2. EFM promotes best standby replica to primary
+3. EFM calls AAP orchestration script: [`efm-orchestrated-failover.sh`](../scripts/efm-orchestrated-failover.sh)
+4. Script scales down AAP in DC1, scales up AAP in DC2
+5. GLB health checks detect DC2 AAP healthy, route traffic to DC2
+6. RTO achieved: <5 minutes
+
+**Configuration:**
+```bash
+# /etc/edb/efm-4.x/efm.properties
+enable.custom.scripts=true
+script.post.promotion=/usr/edb/efm-4.x/bin/efm-orchestrated-failover.sh %h %s %a %v
+```
+
+### AAP Cluster Scripts
+
+**See:** [AAP Cluster Scripts Documentation](../scripts/README.md)
+
+**Operational scripts:**
+- [`scale-aap-up.sh`](../scripts/scale-aap-up.sh) - Scale AAP to operational state in target datacenter
+- [`scale-aap-down.sh`](../scripts/scale-aap-down.sh) - Scale AAP to zero in inactive datacenter
+- [`efm-orchestrated-failover.sh`](../scripts/efm-orchestrated-failover.sh) - Full DR failover orchestration
+- [`validate-aap-data.sh`](../scripts/validate-aap-data.sh) - Post-failover data validation
+- [`monitor-efm-scripts.sh`](../scripts/monitor-efm-scripts.sh) - EFM integration monitoring
+
+**Runbook:**
+- [AAP Cluster Management Runbook](manual-scripts-doc.md) - Step-by-step operational procedures
+
+---
+
+## Disaster Recovery Scenarios
+
+**See:** [DR Scenarios Documentation](dr-scenarios.md)
+
+**Documented failure scenarios:**
+
+1. **Single Pod Failure (Database or AAP)** - Automatic Kubernetes restart
+ - RTO: <30 seconds
+ - RPO: 0 (no data loss)
+ - Automation: Kubernetes liveness/readiness probes
+
+2. **Database Cluster Failure (DC1)** - EFM automated failover
+ - RTO: <1 minute
+ - RPO: <5 seconds
+ - Automation: EFM promotion + service updates
+
+3. **Complete Datacenter Failure (DC1)** - Manual failover to DC2
+ - RTO: <5 minutes
+ - RPO: <5 seconds
+ - Automation: AAP playbook or manual script execution
+
+4. **Data Corruption (Logical)** - Point-in-time recovery
+ - RTO: 2-4 hours
+ - RPO: <1 minute (depends on backup schedule)
+ - Automation: PITR scripts
+
+5. **Network Partition (Split-Brain)** - Prevention via database role checks
+ - RTO: N/A (prevention measure)
+ - RPO: N/A
+ - Automation: Pre-startup validation scripts
+
+6. **Cascading Failures (Both DCs)** - Recovery from S3 backups
+ - RTO: <24 hours
+ - RPO: <5 minutes
+ - Automation: Disaster recovery runbook
+
+**DR Testing:**
+- **Quarterly DR drills:** Automated via `CronJob` - see [DR Testing Guide](dr-testing-guide.md)
+- **Validation scripts:** Data integrity checks post-failover
+- **RTO/RPO measurement:** Automated metrics collection during tests
+
+---
+
+## Scaling Considerations
+
+### Horizontal Scaling (Adding Instances)
+
+**PostgreSQL (OpenShift):**
+
+```yaml
+# Edit cluster.yaml
+spec:
+ instances: 3 # Increase from 2 to 3
+```
+
+Apply changes:
+```bash
+oc apply -k db-deploy/sample-cluster/
+```
+
+**Benefits:**
+- Increased read capacity (more read replicas)
+- Higher availability (more failover candidates)
+- Better resource distribution
+
+**Considerations:**
+- More instances = more replication overhead
+- Diminishing returns beyond 3-5 instances per cluster
+- Network bandwidth requirements increase
+
+---
+
+### Vertical Scaling (Resource Limits)
+
+**PostgreSQL (OpenShift):**
+
+```yaml
+# Edit cluster.yaml
+spec:
+ resources:
+ requests:
+ cpu: "2"
+ memory: "4Gi"
+ limits:
+ cpu: "4"
+ memory: "8Gi"
+```
+
+**AAP (OpenShift):**
+
+```yaml
+# Edit ansibleautomationplatform.yaml
+spec:
+ controller:
+ resources:
+ requests:
+ cpu: "2"
+ memory: "4Gi"
+ limits:
+ cpu: "4"
+ memory: "8Gi"
+```
+
+**Recommendations:**
+- **PostgreSQL:** 2-4 CPU cores, 4-8 GB RAM per instance (typical)
+- **AAP Controller:** 2-4 CPU cores, 4-8 GB RAM per replica
+- **AAP Hub:** 1-2 CPU cores, 2-4 GB RAM per replica
+- **Monitor resource utilization** before scaling up
+
+---
+
+### Storage Scaling
+
+**Resize PostgreSQL PVCs:**
+
+```bash
+# Check current size
+oc get pvc -n edb-postgres
+
+# Edit PVC (if StorageClass supports expansion)
+oc edit pvc postgresql-1 -n edb-postgres
+# Increase storage size in spec.resources.requests.storage
+
+# Operator automatically handles resize
+```
+
+**Best practices:**
+- Plan for 3-6 months of data growth
+- Monitor disk usage weekly
+- Keep 20% free space minimum
+- Use separate volumes for WAL if high write workload
+
+---
+
+### Geographic Distribution
+
+**Multi-region deployment:**
+
+1. **Deploy primary cluster** in primary region (DC1)
+2. **Deploy replica cluster** in DR region (DC2)
+3. **Configure cross-region replication** via OpenShift Routes or VPN
+4. **Set up S3 buckets** in both regions for backups
+5. **Configure cross-region S3 replication** for backup redundancy
+
+**Latency considerations:**
+- **Streaming replication:** Works well up to 100ms latency
+- **High latency (>100ms):** Consider asynchronous replication only
+- **Very high latency (>500ms):** Use WAL shipping as primary method
+
+**See:** [OpenShift Installation Guide - Scaling](install-kubernetes-manual.md#scaling-considerations)
+
+---
+
+## Related Architecture Documentation
+
+### Core Architecture
+- **[Main README](../README.md)** - Architecture overview and quick links
+- **[RHEL AAP Architecture](rhel-aap-architecture.md)** - AAP on RHEL detailed architecture
+- **[OpenShift AAP Architecture](openshift-aap-architecture.md)** - AAP on OpenShift detailed architecture
+
+### Deployment Guides
+- **[OpenShift Installation](install-kubernetes-manual.md)** - Detailed OpenShift deployment
+- **[RHEL Installation with TPA](install-tpa.md)** - Automated RHEL deployment
+- **[Cross-Cluster Replication](../db-deploy/cross-cluster/README.md)** - DC1→DC2 replication setup
+
+### Operations & DR
+- **[DR Scenarios](dr-scenarios.md)** - 6 documented failure scenarios
+- **[DR Testing Guide](dr-testing-guide.md)** - Complete testing framework
+- **[EDB Failover Manager](enterprisefailovermanager.md)** - EFM integration
+- **[Split-Brain Prevention](split-brain-prevention.md)** - Database role validation
+- **[Operations Runbook](manual-scripts-doc.md)** - Day-to-day procedures
+- **[Troubleshooting](troubleshooting.md)** - Common issues and diagnostics
+
+### Scripts & Automation
+- **[Scripts Reference](../scripts/README.md)** - All automation scripts
+- **[AAP Scaling Scripts](../scripts/)** - scale-aap-up.sh, scale-aap-down.sh
+- **[DR Failover Scripts](../scripts/)** - efm-orchestrated-failover.sh, dr-failover-test.sh
+- **[Validation Scripts](../scripts/)** - validate-aap-data.sh, measure-rto-rpo.sh
+
+---
+
+**Architecture Documentation Complete**
+
+For questions or improvements, see [CONTRIBUTING.md](../CONTRIBUTING.md) or open an issue on GitHub.