From 36389f5872291f4a93c094d692ea5aea1cc8ed51 Mon Sep 17 00:00:00 2001 From: IsaccoSordo Date: Thu, 15 May 2025 16:21:48 +0200 Subject: [PATCH 1/4] feat: postmortem 5 --- sidebars.js | 8 +++- src/docs/post-mortem-5.md | 78 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 85 insertions(+), 1 deletion(-) create mode 100644 src/docs/post-mortem-5.md diff --git a/sidebars.js b/sidebars.js index edbd14e9..b2ebb078 100644 --- a/sidebars.js +++ b/sidebars.js @@ -99,7 +99,13 @@ module.exports = { { type: "category", label: "Post Mortem", - items: ["post-mortem", "post-mortem-2", "post-mortem-3", "post-mortem-4"], + items: [ + "post-mortem", + "post-mortem-2", + "post-mortem-3", + "post-mortem-4", + "post-mortem-5", + ], collapsed: true, }, { diff --git a/src/docs/post-mortem-5.md b/src/docs/post-mortem-5.md new file mode 100644 index 00000000..d33291ae --- /dev/null +++ b/src/docs/post-mortem-5.md @@ -0,0 +1,78 @@ +--- +title: Postmortem Incident 5 +slug: /post-mortem-5 +--- + +**Date:** 2025-05-15 + +**Authors:** Isacco Sordo + +**Status:** Complete, action items in progress + +**Summary:** +An improper restart of the node `beacon-node-1.beacon-server-1.papers.tech` caused partial operational failure of the Synapse instance, resulting in connectivity issues for US-based users. + +**Impact:** +US-based users experienced connectivity disruptions due to always being routed to a faulty node. This issue caused intermittent network failures and degraded performance. + +**Root Causes:** +The incident originated from an improper restart of the node `beacon-node-1.beacon-server-1.papers.tech`. The following factors contributed to the incident's severity: + +- **Improper Node Restart:** The node was improperly restarted, causing it to enter a partially operational state rather than fully operational. +- **Faulty Node Selection Algorithm:** The SDK's algorithm for selecting the node with the lowest latency was flawed, leading US users to repeatedly connect to the malfunctioning node. + +**Affected Systems:** + +- **Synapse Instance:** Partial operational failure impacting user connectivity. +- **SDK Network Node Selection:** Users relying on SDK for optimal node selection were persistently directed to the faulty node. + +**Resolution:** +Immediate corrective steps involved: + +1. **Node Restart:** + + - Properly restarted the node to restore complete operational status. + +2. **Algorithm Update:** + + - Initiated modifications to the node-selection algorithm, changing its strategy from sequential latency checks to parallel checks. + +3. **Enhanced Monitoring:** + + - Implemented monitoring checks on the Synapse instance. + - Adjusted monitoring intervals to every 10 minutes for proactive detection. + +**Detection:** +The issue was initially identified by SDK users who reported connectivity issues. Monitoring systems subsequently validated the Synapse instance's degraded performance. + +**Action Items:** + +| Action Item | Owner | State | +| -------------------------------------------------------------- | ------ | ----------- | +| Update node selection algorithm for parallel latency detection | Isacco | IN PROGRESS | +| Configure monitoring to perform checks every 10 minutes | Lukas | COMPLETE | +| Integrate Synapse instance monitoring | Lukas | COMPLETE | + +## Conclusion + +This incident highlighted two critical issues: the necessity for proper procedures when restarting nodes and robust node selection algorithms. By addressing these factors, future connectivity disruptions can be significantly reduced or avoided entirely. + +## Learnings + +- **Parallelization in Node Selection:** + Implementing parallel node latency checks ensures faster, more reliable network selection, minimizing the impact of node-specific outages. + +- **Proactive Monitoring:** + Enhanced monitoring frequency and additional checks provide early warning signals and reduce downtime duration. + +## Timeline + +- **09:30:** Node restart executed improperly; Synapse enters partially operational state. +- **09:35:** SDK begins routing US-based users to the faulty node due to latency detection flaw. +- **09:50:** Initial user reports of connectivity problems begin surfacing. +- **10:00:** Monitoring systems confirm the Synapse instance operational issue. +- **10:15:** Team identifies root cause related to improper node restart. +- **10:30:** Node properly restarted, restoring normal operations. +- **11:00:** Monitoring intervals adjusted to 10-minute checks. +- **11:30:** Algorithm changes initiated for parallel latency detection. +- **12:00:** Synapse instance added to comprehensive monitoring. From 5a58bcf12dfd4843d9a778b3309a3212f3560158 Mon Sep 17 00:00:00 2001 From: IsaccoSordo Date: Thu, 15 May 2025 16:44:44 +0200 Subject: [PATCH 2/4] fix: text --- src/docs/post-mortem-5.md | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/src/docs/post-mortem-5.md b/src/docs/post-mortem-5.md index d33291ae..45b100fc 100644 --- a/src/docs/post-mortem-5.md +++ b/src/docs/post-mortem-5.md @@ -47,11 +47,11 @@ The issue was initially identified by SDK users who reported connectivity issues **Action Items:** -| Action Item | Owner | State | -| -------------------------------------------------------------- | ------ | ----------- | -| Update node selection algorithm for parallel latency detection | Isacco | IN PROGRESS | -| Configure monitoring to perform checks every 10 minutes | Lukas | COMPLETE | -| Integrate Synapse instance monitoring | Lukas | COMPLETE | +| Action Item | Owner | State | +| -------------------------------------------------------------------------------- | ------ | ----------- | +| Update node selection algorithm for parallel latency detection | Isacco | IN PROGRESS | +| Configure monitoring to trigger pager duty incidents repeatedly every 10 minutes | Lukas | COMPLETE | +| Integrate Synapse instance monitoring | Lukas | COMPLETE | ## Conclusion @@ -67,12 +67,9 @@ This incident highlighted two critical issues: the necessity for proper procedur ## Timeline -- **09:30:** Node restart executed improperly; Synapse enters partially operational state. -- **09:35:** SDK begins routing US-based users to the faulty node due to latency detection flaw. -- **09:50:** Initial user reports of connectivity problems begin surfacing. -- **10:00:** Monitoring systems confirm the Synapse instance operational issue. -- **10:15:** Team identifies root cause related to improper node restart. -- **10:30:** Node properly restarted, restoring normal operations. -- **11:00:** Monitoring intervals adjusted to 10-minute checks. -- **11:30:** Algorithm changes initiated for parallel latency detection. -- **12:00:** Synapse instance added to comprehensive monitoring. +- 08:56 - Node restart due to maintenance +- 08:57 - Pager Duty call dismissed as expected due to restart +- 08:58 - Node uptime check (Synapse): beacon-node-1.beacon-server-1.papers.tech returns 200 OK +- 20:47 - Kukai Team notifies beacon team in slack about an increase in user reports +- 00:16 - Kukai Team notices the Matrix part of the node is not working properly +- 08:00 - Service recovered From 31e19a98b64147162b2eece5e625bbe08aa15f50 Mon Sep 17 00:00:00 2001 From: IsaccoSordo Date: Thu, 15 May 2025 16:54:52 +0200 Subject: [PATCH 3/4] fix: text --- src/docs/post-mortem-5.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/docs/post-mortem-5.md b/src/docs/post-mortem-5.md index 45b100fc..035580df 100644 --- a/src/docs/post-mortem-5.md +++ b/src/docs/post-mortem-5.md @@ -62,8 +62,7 @@ This incident highlighted two critical issues: the necessity for proper procedur - **Parallelization in Node Selection:** Implementing parallel node latency checks ensures faster, more reliable network selection, minimizing the impact of node-specific outages. -- **Proactive Monitoring:** - Enhanced monitoring frequency and additional checks provide early warning signals and reduce downtime duration. +- **Recurring pager duty alerts:** the event was unnoticed because the initial trigger was expected due to the maintenace restart but the resulting downtime was not noticed caused by this. Recurring pager duty calls in case of service interruption will avoid this scenario in the future. ## Timeline From 33330f51970d3a46cbeae2223985bdce6494a05a Mon Sep 17 00:00:00 2001 From: IsaccoSordo Date: Thu, 15 May 2025 16:57:40 +0200 Subject: [PATCH 4/4] fix: capital letter --- src/docs/post-mortem-5.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/post-mortem-5.md b/src/docs/post-mortem-5.md index 035580df..bc8f87fa 100644 --- a/src/docs/post-mortem-5.md +++ b/src/docs/post-mortem-5.md @@ -62,7 +62,7 @@ This incident highlighted two critical issues: the necessity for proper procedur - **Parallelization in Node Selection:** Implementing parallel node latency checks ensures faster, more reliable network selection, minimizing the impact of node-specific outages. -- **Recurring pager duty alerts:** the event was unnoticed because the initial trigger was expected due to the maintenace restart but the resulting downtime was not noticed caused by this. Recurring pager duty calls in case of service interruption will avoid this scenario in the future. +- **Recurring pager duty alerts:** The event was unnoticed because the initial trigger was expected due to the maintenace restart but the resulting downtime was not noticed caused by this. Recurring pager duty calls in case of service interruption will avoid this scenario in the future. ## Timeline