diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index dd505930..dd67ce14 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -694,6 +694,12 @@ JVM analyticdb ttl XUANWU +Grafana +Nginx +pingtime +reconfig +RTO +wtimeout bfd currentopmetrics daguozb diff --git a/docs/images/disaster_recovery_architecture.png b/docs/images/disaster_recovery_architecture.png new file mode 100644 index 00000000..75602c5b Binary files /dev/null and b/docs/images/disaster_recovery_architecture.png differ diff --git a/docs/platform-ops/production-deploy/disaster-recovery.md b/docs/platform-ops/production-deploy/disaster-recovery.md new file mode 100644 index 00000000..9a1050de --- /dev/null +++ b/docs/platform-ops/production-deploy/disaster-recovery.md @@ -0,0 +1,338 @@ +# TapData Cross-Data Center Disaster Recovery Guide + +This guide outlines TapData's disaster recovery (DR) solution for cross-data center or availability zone failures, ideal for production setups where an entire site outage could disrupt operations. By leveraging a "primary cluster + cold standby services + hot data" model, it ensures real-time data replication across regions and quick service failover, minimizing downtime during regional disruptions. + +:::tip +For high availability within a single cluster (to handle individual node failures), see [Deploying High-Availability TapData Enterprise (Three Nodes)](install-tapdata-ha-with-3-node.md). +::: + +## 1. Overview and Solution Design + +TapData uses a distributed architecture with key components like TapData Management (for task orchestration), TapData Engine (for data processing), TapData Agent (for data collection), and MongoDB (for metadata storage). While intra-cluster high availability protects against single-point failures, a full data center outage can still halt business activities. + +To boost resilience and data continuity, TapData offers a cross-data center/availability zone DR approach, suitable for scenarios such as: +- **Multi-site deployments**: Primary and standby sites in separate data centers or geographic regions. +- **Regional outage protection**: Handling power failures, network blackouts, natural disasters, or other site-wide issues. +- **Stringent continuity needs**: Meeting low Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets. +- **Compliance standards**: Fulfilling regulatory demands for robust DR capabilities. + +![Disaster Recovery Architecture](../../images/disaster_recovery_architecture.png) + +The solution employs a dual-site setup with TapData's stateless design and MongoDB's cross-region replica sets for seamless high-availability DR. + +- **Zone A (Primary Cluster)**: Hosts the full TapData stack (3 nodes) + MongoDB replica set (3 nodes), managing all production workloads. +- **Zone S (Standby Cluster)**: Pre-configured TapData services (stopped, 3 nodes) + MongoDB replica set (2 nodes active + 1 pre-provisioned), focused solely on data replication during normal operations. + +:::tip Data Replication Mechanism +MongoDB's cross-region replica set handles real-time syncing of tasks, metadata, and system states, keeping the standby site in sync with the primary. For detailed replica set setup and failover details, refer to the sections below. +::: + +## 2. Host Setup and Deployment Requirements + +Based on the architecture above, here's an example host configuration and deployment guidelines: + +### 2.1 Host Configuration Details + +| Hostname | IP Address | Zone | MongoDB Role | TapData Services | Notes | +|------------|----------------|--------|---------------------------|------------------|------------------------| +| tapdata-a | 192.168.1.10 | Zone A | Primary | **Running** | Primary node | +| tapdata-b | 192.168.1.11 | Zone A | Secondary | **Running** | Secondary node | +| tapdata-c | 192.168.1.12 | Zone A | Secondary | **Running** | Secondary node | +| tapdata-d | 192.168.2.10 | Zone S | Secondary (non-voting) | **Stopped** | Standby node | +| tapdata-e | 192.168.2.11 | Zone S | Secondary (non-voting) | **Stopped** | Standby node | +| tapdata-f | 192.168.2.12 | Zone S | Pre-provisioned (not in replica set) | **Stopped** | Activated only during Zone A failure | + +:::tip Network Latency Recommendations +Use dedicated lines for cross-region connectivity, aiming for latency under **100ms** to maintain MongoDB replication performance and timely failure detection. Ensure required ports are open, such as MongoDB (27017), TapData Management (3030), TapData Engine (3080), and SSH (22). +::: + +### 2.2 MongoDB Cross-Region Replica Set Configuration + +MongoDB serves as TapData's core storage for task configs, metadata, and system states. This setup uses a 5-node cross-region replica set: Zone A (3 nodes) + Zone S (2 nodes), with priorities favoring Zone A for the primary role. + +**Key Benefits:** +- **Automatic Syncing**: Data replicates from Zone A to Zone S in real time. +- **Failover Automation**: Built-in election handles primary node failures. +- **Redundancy**: Multiple copies prevent data loss from single failures. + +**Recommended Replica Set Config:** +```javascript +rs.reconfig({ + members: [ + { _id: 0, host: "192.168.1.10:27017", priority: 2 }, // Zone A primary + { _id: 1, host: "192.168.1.11:27017", priority: 1 }, // Zone A secondary + { _id: 2, host: "192.168.1.12:27017", priority: 1 }, // Zone A secondary + { _id: 3, host: "192.168.2.10:27017", priority: 0 }, // Zone S secondary + { _id: 4, host: "192.168.2.11:27017", priority: 0 }, // Zone S secondary + ] +}) +``` + +**Recommended Write Concern Settings:** +```javascript +// TapData's suggested MongoDB write concern for reliability +db.adminCommand({ + setDefaultRWConcern: 1, + defaultWriteConcern: { + w: "majority", // Confirm after writing to most nodes + j: true, // Ensure journal commit + wtimeout: 5000 // 5-second timeout + } +}) +``` + +This ensures writes are acknowledged only after reaching at least 3 nodes (majority in a 5-node set), guaranteeing consistency. Writes typically occur in Zone A, with automatic replication to Zone S. + +### 2.3 TapData Service Configuration + +**Connection String Setup:** +All TapData nodes share a unified MongoDB replica set connection string: +```bash +# Replace {databaseName} with your TapData database name +mongodb://192.168.1.10:27017,192.168.1.11:27017,192.168.1.12:27017,192.168.2.10:27017,192.168.2.11:27017,192.168.2.12:27017/tapdata?replicaSet=tapdata-rs +``` + +**Deployment Strategy:** +- **Zone A**: All TapData services run actively for production. +- **Zone S**: Services are installed and configured but remain stopped as cold standby. +- **Config Sync**: Regularly mirror configs from Zone A to Zone S. + +**Key Config Files to Sync:** +- **`application.yml`**: Main TapData service config. +- **`agent.yml`**: TapData Agent config. + +**Sync Command Example:** +```bash +# Copy configs from Zone A to Zone S (adjust based on your TapData install path) +scp /conf/{application.yml,agent.yml} root@192.168.2.10:/conf/ +scp /conf/{application.yml,agent.yml} root@192.168.2.11:/conf/ +scp /conf/{application.yml,agent.yml} root@192.168.2.12:/conf/ +``` + +## 3. Disaster Recovery Procedures + +### 3.1 Recovery Targets + +#### 3.1.1 Recovery Time Objective (RTO) +- **Detection**: 1-2 minutes (via alerts + manual confirmation). +- **MongoDB Failover**: 30 seconds-1 minute (automatic election). +- **TapData Startup**: 2-3 minutes (service launch + health checks). +- **Load Balancer Switch**: 1-2 minutes (DNS/LB updates). +- **Total RTO**: 5-8 minutes. + +#### 3.1.2 Recovery Point Objective (RPO) +- **Real-Time Sync**: Handled by replica set replication. +- **Theoretical RPO**: ≤ 30 seconds (accounting for network latency). +- **Practical RPO**: ≤ 1 minute (factoring in potential jitter). + +### 3.2 Failover (Zone A to Zone S) + +:::tip Automated Detection +Set up monitoring and alerts for automatic failure detection—see [Monitoring and Automation](#41-monitoring-and-automation) for details. +::: + +#### 3.2.1 Pre-Failover Checks + +**Confirm Outage:** +1. **Multi-Path Verification**: Use various routes and monitoring tools to verify Zone A is inaccessible. +2. **Scope Assessment**: Determine if it's a full Zone A failure or isolated nodes. + +**Standby Readiness:** + +1. **Zone S MongoDB Health**: Ensure tapdata-d and tapdata-e are running with no replication lag. +2. **TapData Configs**: Confirm configs are synced and dependencies are available. +3. **Connectivity**: Verify Zone S can reach external systems. + +#### 3.2.2 Failover Steps + +1. **Start MongoDB on Zone S** + + Log in to tapdata-f and run: + ```bash + systemctl start mongod + ``` + +2. **Reconfigure Replica Set** + + Log in to tapdata-d or tapdata-e and execute: + ```javascript + mongo --host tapdata-d:27017 # Or tapdata-e:27017 + + // Add node f first + rs.add({ _id: 5, host: "192.168.2.12:27017", priority: 0 }) + + // Then reconfigure to 3-node setup + var conf = rs.conf() + conf.members = [ + { _id: 3, host: "tapdata-d:27017", priority: 1 }, + { _id: 4, host: "tapdata-e:27017", priority: 0 }, + { _id: 5, host: "tapdata-f:27017", priority: 0 } + ] + rs.reconfig(conf, {force:true}) + ``` + +3. **Update Task States** + + Log in to the new primary (usually tapdata-d) and run: + ```javascript + mongo --host tapdata-d:27017 # Or current primary + use {databaseName}; # Replace with your TapData database name + db.Task.updateMany({}, { $set: { pingtime: -1 } }) + ``` + +4. **Launch TapData Services** + + On each Zone S node (tapdata-d, tapdata-e, tapdata-f), run: + ```bash + systemctl start tapdata + ``` + +5. **Redirect Load Balancer** to Zone S + > Update your access layer (e.g., Nginx, HAProxy, F5) to route client traffic from Zone A to Zone S TapData services. + +6. **Validate Recovery**: + - **Task Health**: Ensure sync tasks are running without errors. + - **Data Integrity**: Sample source-to-target comparisons for completeness. + - **Service Availability**: Test via web UI or API. + - **Performance Metrics**: Monitor CPU, memory, and network usage. + - **Logs**: Review for anomalies. + +### 3.3 Recovery (Zone S to Zone A) + +#### 3.3.1 Pre-Recovery Checks + +**Zone A Readiness:** +1. **Infrastructure**: Confirm hardware, network, and storage are fully operational. +2. **Connectivity**: Test links between Zone A, external systems, and Zone S. +3. **Dependencies**: Verify related services (e.g., databases, queues) are up. + +**Data Sync Validation:** +1. **Zone S Integrity**: Check current data for completeness. +2. **Lag Check**: Ensure no replication delays across the set. +3. **Monitoring**: Confirm alert systems are functional for the switchback. + +#### 3.3.2 Recovery Steps + +1. **Start MongoDB on Zone A** + + On each Zone A node (tapdata-a, tapdata-b, tapdata-c), run: + ```bash + systemctl start mongod + ``` + +2. **Reconfigure 5-Node Replica Set** + + Log in to the current primary (in Zone S, e.g., tapdata-d) and execute: + ```javascript + mongo --host tapdata-d:27017 # Or current primary + var conf = rs.conf(); + conf.members = [ + { _id: 0, host: "tapdata-a:27017", priority: 2 }, + { _id: 1, host: "tapdata-b:27017", priority: 1 }, + { _id: 2, host: "tapdata-c:27017", priority: 1 }, + { _id: 3, host: "tapdata-d:27017", priority: 0 }, + { _id: 4, host: "tapdata-e:27017", priority: 0 } + ] + rs.reconfig(conf, {force: true}) + ``` + +3. **Check Replica Set Status** + + Log in to any node (preferably tapdata-a) and run: + ```javascript + mongo --host tapdata-a:27017 + rs.status() + ``` + +4. **Transfer Primary** (If Needed) + + **Condition**: If the primary isn't in Zone A (still in Zone S). + + Log in to the current primary and run: + ```javascript + # Check current primary + mongo --host tapdata-d:27017 + rs.status() + + # Connect to current primary (assuming tapdata-d) + mongo --host tapdata-d:27017 + + # Step down to trigger re-election + rs.stepDown() + + # Confirm new primary is in Zone A + rs.status() + ``` + +5. **Stop Zone S Services** + + On tapdata-d and tapdata-e, run: + ```bash + systemctl stop tapdata + ``` + + On tapdata-f, run: + ```bash + systemctl stop tapdata + systemctl stop mongod # Stop MongoDB + ``` + +6. **Update Tasks and Start Zone A Services** + + Log in to Zone A's primary (usually tapdata-a) and run: + ```javascript + mongo --host tapdata-a:27017 # Or current primary + use {databaseName}; # Replace with your TapData database name + db.Task.updateMany({}, { $set: { pingtime: -1 } }) + ``` + + On each Zone A node (tapdata-a, tapdata-b, tapdata-c), run: + ```bash + systemctl start tapdata + ``` + +7. **Validate and Gradually Shift Traffic** back to the primary site. + + Perform checks from any Zone A or management node. + +## 4. Operations Best Practices + +### 4.1 Monitoring and Automation + +#### 4.1.1 Failure Detection and Alerts + +**Detection Intervals:** +- **Connectivity**: TCP port checks (MongoDB 27017, TapData 3030) every 30 seconds. +- **Service Health**: Replica set status and TapData API checks every 1 minute. +- **Business Metrics**: Task states and replication lag every 5 minutes. + +**Alert Rules:** +- MongoDB primary unresponsive for 2 minutes → Critical. +- TapData service down for 3 minutes → Critical. +- Replication lag >5 minutes → Warning. +- >50% Zone A nodes failed → Auto-trigger failover. + +#### 4.1.2 Automation Scripts + +**Health Monitoring:** +- Integrate tools like Prometheus + Grafana for real-time node and performance tracking. + +**Automated Failover:** +- Trigger scripts on failure conditions for replica set reconfiguration, service starts, and validations. +- Test thoroughly in staging before production rollout. + +### 4.2 Regular Drills and Monitoring + +- **Drill Schedule**: Conduct full DR simulations quarterly, covering detection, failover, validation, and recovery. +- **Key Metrics**: Track uptime, sync lag, and resource utilization. +- **Alerting**: Use tiered notifications for prompt responses. +- **Documentation**: Log drill outcomes, issues, and improvements. + +Consistent drills and monitoring refinements are essential for reliable DR. Tailor strategies to your business needs for ongoing enhancements. + +### 4.3 Documentation and Security + +- **Procedures**: Keep DR docs current, including emergency contacts. +- **Change Logs**: Record all architecture and config updates. +- **Access Controls**: Limit permissions for standby environments. +- **Data Security**: Secure cross-region transfers and routinely verify backup integrity. \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index 0d9db546..6be7a1c6 100644 --- a/sidebars.js +++ b/sidebars.js @@ -408,6 +408,7 @@ const sidebars = { 'platform-ops/production-deploy/capacity-planning', 'platform-ops/production-deploy/install-tapdata-ha', 'platform-ops/production-deploy/install-tapdata-ha-with-3-node', + 'platform-ops/production-deploy/disaster-recovery', 'platform-ops/production-deploy/install-replica-mongodb', ] },