Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 165 additions & 180 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,50 +8,49 @@
## Table of Contents

- [Overview](#overview)
- [Quick Links](#quick-links)
- [Installation](#installation)
- [RHEL / hosts — Trusted Postgres Architect (TPA) (recommended)](docs/install-tpa.md)
- [OpenShift — EDB operator & manual](docs/install-kubernetes-manual.md)
- [OpenShift — Kustomize manifests (`db-deploy/`)](db-deploy/README.md)
- [OpenShift — AAP operator with external Postgres (`aap-deploy/`)](aap-deploy/README.md)
- [RHEL manual installation](docs/install-rhel-manual.md)
- [OpenShift manual installation](docs/install-kubernetes-manual.md)
- [Architecture](#architecture)
- [Component Details](#component-details)
- [Global Load Balancer](#global-load-balancer)
- [Ansible Automation Platform (AAP)](#ansible-automation-platform-aap)
- [Network Connectivity](#network-connectivity)
- [User to AAP (via Global Load Balancer)](#user-to-aap-via-global-load-balancer)
- [AAP to PostgreSQL Databases](#aap-to-postgresql-databases)
- [Inter-Datacenter Replication](#inter-datacenter-replication)
- [Write Operations (Normal State)](#write-operations-normal-state)
- [Read Operations](#read-operations)
- [Backup Flow](#backup-flow)
- [AAP Deployment Architecture](#aap-deployment-architecture)
- [RHEL AAP Architecture](docs/rhel-aap-architecture.md)
- [OpenShift AAP Architecture](docs/openshift-aap-architecture.md)
- [AAP Cluster Management](#aap-cluster-management)
- [Integration with EDB EFM (Enterprise Failover Manager)](#integration-with-edb-efm-enterprise-failover-manager)
- [AAP cluster management — runbook](docs/manual-scripts-doc.md)
- [AAP cluster scripts (`scripts/README.md`)](scripts/README.md)
- [EFM Integration (EDB Failover Manager)](docs/enterprisefailovermanager.md)
- [Troubleshooting](docs/troubleshooting.md)
- [Disaster Recovery Scenarios](#disaster-recovery-scenarios)
- [Full scenarios doc](docs/dr-scenarios.md)
- [Scaling Considerations](#scaling-considerations)
- [Horizontal & vertical scaling (OpenShift)](docs/install-kubernetes-manual.md#scaling-considerations)
- [EDB Postgres on OpenShift Architecture](docs/install-kubernetes-manual.md#edb-postgres-on-openshift-architecture)
- [Operations](#operations)
- [Contributing](#contributing)

## Overview

This document describes the architecture of EnterpriseDB Postgres deployed Active/Passive
across two clusters in different datacenters with in datacenter replication for the
Ansible Automation Platform (AAP). This will achieve a **NEAR** HA type architecture,
especially for failover to the databases syncing in region/datacenter.

A DR scenario should be exactly for if there is a catastrophic failure. Failing to an
in-site database should cause little to no intervention needed at the application layer.
The main thing to note is for a DR failover any running jobs will be lost, however if
it fails in site, the jobs should continue to run UNLESS the controller has a failure.
This repository provides a complete solution for deploying Ansible Automation Platform (AAP) with
EnterpriseDB PostgreSQL in a multi-datacenter Active/Passive configuration. The architecture
achieves **near-HA** with automatic failover within datacenters and orchestrated failover across
datacenters.

**Key Features:**
- ✅ **Multi-datacenter HA/DR** - Active-Passive across two datacenters
- ✅ **Automatic failover** - In-datacenter failover <1 minute
- ✅ **PostgreSQL replication** - Physical streaming + WAL archiving
- ✅ **AAP orchestration** - Automated scaling during failover
- ✅ **Comprehensive testing** - Automated DR testing framework
- ✅ **Production-ready** - Security, monitoring, backup strategies

**Target RTO/RPO:**
- **In-datacenter failover:** RTO <1 minute, RPO <5 seconds
- **Cross-datacenter failover:** RTO <5 minutes, RPO <5 seconds

## Quick Links

### Getting Started
- **[🚀 Quick Start Guide](docs/quick-start-guide.md)** - Deploy in 15-30 minutes
- **[📚 Documentation Index](docs/INDEX.md)** - Complete documentation organized by topic
- **[🏗️ Architecture Details](docs/architecture.md)** - Comprehensive architecture documentation

### Deployment
- **[OpenShift Deployment](docs/install-kubernetes-manual.md)** - Operator-based deployment
- **[RHEL with TPA](docs/install-tpa.md)** - Automated deployment with Trusted Postgres Architect
- **[Database Deploy (Kustomize)](db-deploy/README.md)** - GitOps-friendly manifests
- **[AAP Deploy (Kustomize)](aap-deploy/README.md)** - AAP operator deployment

### Operations
- **[Operations Runbook](docs/manual-scripts-doc.md)** - Day-to-day operational procedures
- **[Scripts Reference](scripts/README.md)** - All automation scripts documented
- **[DR Testing Guide](docs/dr-testing-guide.md)** - Complete DR testing framework
- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions

## Installation

Expand All @@ -62,6 +61,16 @@ from EnterpriseDB for Postgres on **bare metal, cloud instances, or SSH-managed
TPA does **not** deploy the **EDB Postgres on OpenShift** operator; for Postgres **on OpenShift
as pods**, use the operator and manual/GitOps steps in this repo.

### Installation Quick Reference

| Platform | Time | Guide |
|----------|------|-------|
| **OpenShift** | 15 min | [Quick Start - OpenShift](docs/quick-start-guide.md#quick-start-openshift-15-minutes) |
| **RHEL with TPA** | 20 min | [Quick Start - RHEL](docs/quick-start-guide.md#quick-start-rhel-with-tpa-20-minutes) |
| **Local CRC** | 30 min | [Quick Start - CRC](docs/quick-start-guide.md#quick-start-local-testing-with-crc-30-minutes) |

### Detailed Installation Guides

| Area | Description | Guide |
|------|-------------|--------|
| **RHEL / hosts (TPA)** *(recommended)* | `tpaexec` workflows for supported platforms (bare metal, cloud, Docker for testing) | [TPA install](docs/install-tpa.md)<br>[RHEL / Ansible entry](docs/install-tpa.md#rhel-tpa-ansible)<br>[TPA on GitHub](https://github.com/EnterpriseDB/tpa)<br>[EDB TPA docs](https://www.enterprisedb.com/docs/tpa/latest/) |
Expand All @@ -74,147 +83,123 @@ as pods**, use the operator and manual/GitOps steps in this repo.
| **Troubleshooting** | Diagnostics and issue resolution | [Troubleshooting](docs/troubleshooting.md) |
| **AAP cluster scripts & runbook** | Automation and operational procedures | [Scripts](scripts/README.md)<br>[Runbook](docs/manual-scripts-doc.md) |

## Architecture
## Architecture

### Architecture Overview

The solution implements a **multi-datacenter Active/Passive architecture** with:

- **Two datacenters:** DC1 (active), DC2 (passive/DR)
- **PostgreSQL replication:** Physical streaming replication + WAL archiving to S3
- **AAP deployment:** Separate clusters in each datacenter, scaled based on active/passive state
- **Failover orchestration:** EDB Failover Manager (EFM) integration with AAP scaling scripts
- **Global load balancer:** Routes traffic to active datacenter

![EDB Postgres Multi-Datacenter Architecture](images/AAP_EDB.drawio.png)

## Component Details

### Global Load Balancer

The global load balancer provides a single entry point for AAP access:

- **DNS**: `aap.example.com`
- **Type**: Active-Passive (DC1 primary, DC2 standby)
- **Health Checks**: Monitors AAP Controller availability in both datacenters
- **Failover**: Automatic failover to DC2 if DC1 becomes unavailable
- **Routing**: Priority-based routing (100% traffic to DC1 when healthy)
- **Failback**: Automatic or manual failback to DC1 when it recovers
- **Protocols**: HTTPS (port 443), WebSocket support for real-time job updates

### Ansible Automation Platform (AAP)

**Operator install with external EDB Postgres** (sample namespace / cluster: `edb-postgres` / `postgresql`):
- See **[`aap-deploy/README.md`](aap-deploy/README.md)** (overview)
- See **[`aap-deploy/openshift/README.md`](aap-deploy/openshift/README.md)** (subscription + `AnsibleAutomationPlatform` CR)

For OpenShift, AAP is deployed on **separate OpenShift clusters** for high availability and
geographic distribution. For RHEL you can do a single install across datacenters however you
**MUST TURN OFF THE SERVICES ON THE SECONDARY SITE**

#### Datacenter 1 - AAP Instance
- **Namespace**: `ansible-automation-platform`
- **AAP Gateway**: 3 replicas for HA
- **AAP Controller**: 3 replicas for HA
- **Automation Hub**: 2 replicas
- **Database**: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
- **Route**: `aap-dc1.apps.ocp1.example.com`

#### Datacenter 2 - AAP Instance (scaled down)
- **Namespace**: `ansible-automation-platform`
- **AAP Gateway**: 3 replicas for HA
- **AAP Controller**: 3 replicas for HA
- **Automation Hub**: 2 replicas
- **Database**: PostgreSQL cluster (1 primary + 2 replicas) managed by EDB operator
- **Route**: `aap-dc2.apps.ocp2.example.com`

#### AAP Database Replication

The AAP databases are replicated from active to passive datacenter:
- **Method**: PostgreSQL logical replication (Active → Passive) - *Note: AAP's internal database uses logical replication for flexibility*
- **Direction**: DC1 (Active) → DC2 (Passive)
- **Mode**: Asynchronous replication with minimal lag
- **Shared Data**: Job templates, inventory, credentials, execution history
- **Failover**: DC2 database promoted to read-write during failover
- **Failback**: Data synchronized back to DC1 when it recovers

#### EDB-Managed PostgreSQL Cluster Replication

EDB-managed application database clusters use physical replication:
- **Method**: PostgreSQL physical replication via streaming replication and WAL shipping
- **Primary Method**: Streaming replication from Primary to Designated Primary
- **Fallback Method**: WAL shipping via S3/object store (continuous WAL archiving)
- **Within Cluster**: Hot standby replicas use streaming replication from primary/designated primary
- **Mode**: Asynchronous streaming with optional synchronous mode
- **Benefits**: Block-level replication, faster failover, exact byte-for-byte replica

## Network Connectivity

### User to AAP (via Global Load Balancer)

Users and automation clients connect to AAP through the global load balancer:
- **URL**: `https://aap.example.com`
- **Protocol**: HTTPS/443 with WebSocket support
- **Load Balancing**: Active-Passive (priority-based)
- **Active Target**: DC1 AAP (100% traffic when healthy)
- **Passive Target**: DC2 AAP (standby, only receives traffic during failover)
- **Health Checks**: Layer 7 health checks to AAP Controller endpoints
- **Session Affinity**: Sticky sessions for long-running jobs
- **TLS Termination**: At load balancer or end-to-end encryption

### AAP to PostgreSQL Databases

AAP can only talk to one Read Write(RW) database at a time:
- **Protocol**: PostgreSQL wire protocol (port 5432)
- **Access**: Via OpenShift Services (ClusterIP within cluster, Routes/LoadBalancer for remote)
- **Authentication**: Certificate-based or password authentication
- **Encryption**: TLS/SSL enforced
- **Connection Pooling**: PgBouncer for efficient connection management

### Inter-Datacenter Replication

#### EDB-Managed Application Database Replication
- **Method**: PostgreSQL physical replication (streaming + WAL shipping)
- **Primary Mechanism**: Streaming replication from Primary to Designated Secondaries
- **Fallback Mechanism**: WAL shipping via S3/object store
- **Direction**: DC1 (Primary Cluster) → DC2 (Replica Cluster)
- **Network**: Encrypted tunnel (VPN/Direct Connect/WAN) for streaming replication
- **Replication Type**: Asynchronous (default) or synchronous (configurable)
- **Lag Monitoring**: Both AAP instances monitor replication lag via EDB operator metrics
- **Alerting**: Alerts triggered if lag exceeds threshold (e.g., 30 seconds)
- **Automatic Service Updates**: EDB operator automatically updates `-rw` service during failover
- **Cross-Cluster Limitation**: Automated failover across OpenShift clusters must be handled externally (via AAP or higher-level orchestration)

### Write Operations (Normal State)

**For EDB-Managed Application Databases:**
1. Application → AAP Controller
2. AAP Controller → DC1 Primary Database (via `-rw` service)
3. DC1 Primary → DC1 Hot Standby Replicas (streaming replication within cluster)
4. DC1 Primary → DC2 Designated Primary (streaming replication across clusters)
5. DC1 Primary → S3/Object Store (continuous WAL archiving - fallback)
6. DC2 Designated Primary → DC2 Hot Standby Replicas (streaming replication within cluster)

### Read Operations

**EDB-Managed Clusters:**
- **DC1 Primary Cluster**:
- Write operations via `prod-db-rw` service (routes to primary)
- Read operations via `prod-db-ro` service (routes to hot standby replicas)
- Read operations via `prod-db-r` service (routes to any instance)
- **DC2 Replica Cluster**:
- Read operations only via `prod-db-replica-ro` service (routes to designated primary or replicas)
- Cannot accept writes unless promoted
- **Load Balancing**: EDB operator manages service routing automatically

**Service Behavior During Failover:**
- EDB operator automatically updates `-rw` service to point to newly promoted primary
- Applications experience seamless redirection without connection string changes

### Backup Flow

**EDB-Managed PostgreSQL Backups:**
1. Scheduled backup job (initiated by AAP or CronJob via EDB operator)
2. Backup pod created by EDB operator
3. Database backup streamed to S3/object store (using Barman Cloud)
4. WAL files continuously archived to S3 (automatic by EDB operator)
5. WAL archiving serves dual purpose:
- Point-in-time recovery (PITR)
- Fallback replication mechanism for replica clusters
6. Replica clusters can recover from WAL archive if streaming replication fails
7. AAP monitors backup completion via operator metrics
8. Alerts sent if backup fails

**Backup Strategy per Datacenter:**
- **DC1**: Full backups + continuous WAL archiving to S3 bucket (primary region)
- **DC2**: Independent backups to separate S3 bucket (DR region) for redundancy
### Key Components

1. **Global Load Balancer** - Single entry point with health check-based routing
2. **Ansible Automation Platform (AAP)** - Deployed in both datacenters
3. **PostgreSQL Clusters** - EDB Postgres Advanced with CloudNativePG operator
4. **Replication** - Streaming replication DC1→DC2 with S3 WAL archive fallback
5. **Backup** - Barman Cloud to S3 with 30-day retention and PITR capability

### Architecture Documentation

**📖 [Complete Architecture Documentation](docs/architecture.md)**

Detailed documentation includes:
- Component details (GLB, AAP, PostgreSQL)
- Network connectivity and data flow
- Replication topology and configuration
- Backup and restore strategies
- Scaling considerations
- Deployment architecture for RHEL and OpenShift

**Platform-Specific Architecture:**
- **[RHEL AAP Architecture](docs/rhel-aap-architecture.md)** - Systemd services, HAProxy, manual orchestration
- **[OpenShift AAP Architecture](docs/openshift-aap-architecture.md)** - Operators, native services, automated orchestration

## Operations

### Day-to-Day Operations

- **[Operations Runbook](docs/manual-scripts-doc.md)** - Step-by-step operational procedures
- **[Script Reference](scripts/README.md)** - All automation scripts with usage examples
- **[Troubleshooting Guide](docs/troubleshooting.md)** - Common issues and diagnostics

### Disaster Recovery

- **[DR Scenarios](docs/dr-scenarios.md)** - 6 documented failure scenarios with procedures
- **[DR Testing Guide](docs/dr-testing-guide.md)** - Complete testing framework with quarterly drills
- **[Split-Brain Prevention](docs/split-brain-prevention.md)** - Database role validation and fencing
- **[EDB Failover Manager](docs/enterprisefailovermanager.md)** - EFM integration and configuration

### Automation Scripts

Located in [`scripts/`](scripts/):

**AAP Management:**
- `scale-aap-up.sh` - Scale AAP to operational state
- `scale-aap-down.sh` - Scale AAP to zero (maintenance/DR)

**DR Orchestration:**
- `efm-orchestrated-failover.sh` - Full automated failover
- `dr-failover-test.sh` - DR testing automation
- `validate-aap-data.sh` - Post-failover validation
- `measure-rto-rpo.sh` - RTO/RPO measurement
- `generate-dr-report.sh` - Automated DR test reporting

**Pre-commit Hooks:**
- `hooks/check-script-permissions.sh` - Verify executable permissions
- `hooks/validate-openshift-manifests.sh` - Validate YAML manifests

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for:

- Documentation standards
- Code standards (shell scripts, YAML)
- Testing requirements
- Pull request process
- Commit message guidelines

### Documentation

All documentation is in [`docs/`](docs/):

- **[Documentation Index](docs/INDEX.md)** - Complete documentation organized by topic
- **[Quick Start Guide](docs/quick-start-guide.md)** - 15-30 minute deployment paths
- **[Architecture](docs/architecture.md)** - Comprehensive architecture documentation

### Repository Structure

```
EDB_Testing/
├── docs/ # All documentation
│ ├── INDEX.md # Documentation index
│ ├── quick-start-guide.md # Quick start (15-30 min)
│ ├── architecture.md # Architecture details
│ ├── dr-testing-guide.md # DR testing framework
│ └── ... # Additional guides
├── db-deploy/ # PostgreSQL deployment manifests
│ ├── operator/ # CloudNativePG operator
│ ├── sample-cluster/ # Base cluster manifests
│ └── cross-cluster/ # DC1→DC2 replication
├── aap-deploy/ # AAP deployment
│ ├── openshift/ # OpenShift manifests
│ └── edb-bootstrap/ # Database initialization
├── scripts/ # Automation scripts
│ ├── scale-aap-*.sh # AAP scaling
│ ├── dr-*.sh # DR orchestration
│ └── validate-*.sh # Validation scripts
├── openshift/ # OpenShift-specific configs
│ └── dr-testing/ # DR testing CronJob
└── .github/ # CI/CD workflows
└── workflows/ # GitHub Actions
```

---

**Questions?** See [docs/INDEX.md](docs/INDEX.md) for complete documentation or open an issue.
Loading
Loading