Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20
Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20Clawue884 wants to merge 16 commits intoPiCoreTeam:masterfrom
Conversation
Implement an auto-recovery script that checks node health and restarts services if unhealthy.
Add a script to check the status of Horizon and Stellar-core services.
This script checks the health of the Horizon and stellar-core services by making HTTP requests and reporting their status.
Add health check script and configure Docker healthcheck.
This script continuously checks the health of services and restarts them if they are not healthy.
Ze0ro99
left a comment
There was a problem hiding this comment.
Here is a comprehensive, professional upgrade plan and set of production-ready features—fully detailed and suitable for both mainnet withdrawal readiness and enterprise-level node operations. This is a superset of the changes in PR #20, and you can use this as the technical basis for a follow-up PR or revised submission.
🚀 Enterprise-Grade Monitoring & Auto-Recovery Proposal for Pi Node Docker
1. Advanced Health Monitoring (health/healthcheck.sh)
Features:
- Checks all critical services (Horizon, Stellar-Core, PostgreSQL)
- Confirms blockchain sync and API health
- Checks disk space, resource use, and logs health outcomes
- Returns unique status codes and logs with timestamps
2. Advanced Auto-Recovery System (health/auto_recover.sh)
Features:
- Exponential backoff, max retry & wait logic
- Sends alerts on each failure/recovery step
- Tracks recovery event history in separate log file
- Graceful degradation + disables repeated flapping
3. Professional Multi-Channel Alerting (health/alert_manager.sh)
Features:
- Supports Telegram, Email, Discord, Slack
- Customizable severity, rate-limiting, deduplication
- Configurable via env file (
alert_config.env) - Alert history, status, and example template included
4. Production Metrics and Monitoring (metrics/node_metrics.sh)
Features:
- Prometheus-formatted output with service, resource, ledger, and custom node metrics
- Exposes hardware usage (CPU/Memory/Disk) and blockchain indicators for dashboards and alerting
5. Hardened Docker Compose Files (docker-compose.production.yml)
Features:
- Integrated stack: node, Prometheus, Grafana, Node Exporter for host metrics, alert config volumes
- Port mapping and healthcheck embedded
- Logging & resource control
6. Comprehensive Prometheus & Alert Rules (monitoring/prometheus.yml, alerts.yml)
Features:
- Prometheus scrapes all key services
- Sample alert rules (down, high CPU/RAM/Disk, etc.)
7. Grafana Dashboard Provisioning
Features:
- Ready-to-import dashboard JSON file for full overview (service status, sync, resource, alerts, events)
8. New Monitoring Documentation (MONITORING.md)
Features:
- All setup, config, and troubleshooting steps with withdrawal-readiness checklist
9. SupervisorD Updates
Features:
- Priority, log rotation, auto-recovery and metric server startup with logging and environment
10. Automated Install Script (setup-monitoring.sh)
Features:
- Prepares all directories, permissions, envs, example configs
- Guides user post-install for secure and correct startup
Example: Enhanced Health Check Script (health/healthcheck.sh)
#!/usr/bin/env bash
set -e
LOG_FILE="/var/log/healthcheck.log"
MIN_DISK_SPACE_GB=10
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [HEALTH] $*" | tee -a "$LOG_FILE"
}
# ...checks for all critical services, disk, sync...
# return nonzero exit for any failure, log status
main() {
# Check all, log, and return status
}
mainExample: Alert Manager (health/alert_manager.sh)
#!/usr/bin/env bash
source /opt/stellar/alert_config.env
log_alert() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [ALERT] $*" >> /var/log/alerts.log; }
send_telegram() { ... }
send_email() { ... }
send_discord() { ... }
send_slack() { ... }
send_alert() {
# call enabled channels, apply rate limit
}
if [ "$1" ]; then send_alert "$1" "${2:-info}"; fiExample: Monitoring Compose (docker-compose.production.yml)
services:
pi-node:
image: pinetwork/pi-node-docker:organization_mainnet-v1.3-p19.6
# ...see above in detailed plan...
prometheus:
# ...
grafana:
# ...
node-exporter:
# ...
volumes:
prometheus-data:
grafana-data:Example: Alerting Rule (monitoring/alerts.yml)
groups:
- name: pi_node_alerts
rules:
- alert: NodeDown
expr: pi_node_horizon_up == 0 OR pi_node_core_up == 0
- alert: HighCPUUsage
expr: pi_node_cpu_usage_percent > 90
# etc.Monitoring Quick Start (MONITORING.md excerpt)
1. cp alert_config.env.example alert_config.env # edit your alert info
2. docker compose -f docker-compose.production.yml up -d
3. Access Grafana (localhost:3000), Prometheus (9090), Node Metrics (9105)
...
Upgrade Recommendations & Next Steps
How to turn this into a PR:
- Copy these features (as code/scripts/configs above) into your repo under the right paths (
health/,metrics/,monitoring/, etc.) - Document all new features in
MONITORING.mdand link it from the README - Test locally (and in a testnet) using realistic failure scenarios
- Submit a new PR with a title like:
Enterprise-Ready Monitoring, Recovery, and Alert System for Pi Node Docker
- In PR body, list all enhancements (as above) and a withdrawal-readiness checklist
|
Thanks a lot for the detailed and thoughtful review, @Ze0ro99 🙏 |
Implement a health check script to monitor services and disk space.
|
Thank you for your response. I intend to help and submit withdrawal requests. It would also be beneficial to contribute more broadly. |
|
Thank you, @Ze0ro99 — really appreciate your support and willingness to contribute further. Yes, absolutely. PR #20 is meant as a foundational step toward production-grade monitoring, and your enterprise upgrade proposal is now being used as the technical blueprint for the next phase. That’s why I’ve started a follow-up PR (#21) to evolve this into a full enterprise-ready monitoring, recovery, and alerting stack: Your input is directly shaping the direction of this project. |
This PR adds production-grade monitoring:
– Docker health check
– Auto-recovery supervisor service
– Telegram real-time alerts
– Prometheus metrics exporter + Grafana ready