Skip to content

Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20

Open
Clawue884 wants to merge 16 commits intoPiCoreTeam:masterfrom
Clawue884:master
Open

Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20
Clawue884 wants to merge 16 commits intoPiCoreTeam:masterfrom
Clawue884:master

Conversation

@Clawue884
Copy link

This PR adds production-grade monitoring:
– Docker health check
– Auto-recovery supervisor service
– Telegram real-time alerts
– Prometheus metrics exporter + Grafana ready

Copy link

@Ze0ro99 Ze0ro99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a comprehensive, professional upgrade plan and set of production-ready features—fully detailed and suitable for both mainnet withdrawal readiness and enterprise-level node operations. This is a superset of the changes in PR #20, and you can use this as the technical basis for a follow-up PR or revised submission.


🚀 Enterprise-Grade Monitoring & Auto-Recovery Proposal for Pi Node Docker

1. Advanced Health Monitoring (health/healthcheck.sh)

Features:

  • Checks all critical services (Horizon, Stellar-Core, PostgreSQL)
  • Confirms blockchain sync and API health
  • Checks disk space, resource use, and logs health outcomes
  • Returns unique status codes and logs with timestamps

2. Advanced Auto-Recovery System (health/auto_recover.sh)

Features:

  • Exponential backoff, max retry & wait logic
  • Sends alerts on each failure/recovery step
  • Tracks recovery event history in separate log file
  • Graceful degradation + disables repeated flapping

3. Professional Multi-Channel Alerting (health/alert_manager.sh)

Features:

  • Supports Telegram, Email, Discord, Slack
  • Customizable severity, rate-limiting, deduplication
  • Configurable via env file (alert_config.env)
  • Alert history, status, and example template included

4. Production Metrics and Monitoring (metrics/node_metrics.sh)

Features:

  • Prometheus-formatted output with service, resource, ledger, and custom node metrics
  • Exposes hardware usage (CPU/Memory/Disk) and blockchain indicators for dashboards and alerting

5. Hardened Docker Compose Files (docker-compose.production.yml)

Features:

  • Integrated stack: node, Prometheus, Grafana, Node Exporter for host metrics, alert config volumes
  • Port mapping and healthcheck embedded
  • Logging & resource control

6. Comprehensive Prometheus & Alert Rules (monitoring/prometheus.yml, alerts.yml)

Features:

  • Prometheus scrapes all key services
  • Sample alert rules (down, high CPU/RAM/Disk, etc.)

7. Grafana Dashboard Provisioning

Features:

  • Ready-to-import dashboard JSON file for full overview (service status, sync, resource, alerts, events)

8. New Monitoring Documentation (MONITORING.md)

Features:

  • All setup, config, and troubleshooting steps with withdrawal-readiness checklist

9. SupervisorD Updates

Features:

  • Priority, log rotation, auto-recovery and metric server startup with logging and environment

10. Automated Install Script (setup-monitoring.sh)

Features:

  • Prepares all directories, permissions, envs, example configs
  • Guides user post-install for secure and correct startup

Example: Enhanced Health Check Script (health/healthcheck.sh)

#!/usr/bin/env bash
set -e
LOG_FILE="/var/log/healthcheck.log"
MIN_DISK_SPACE_GB=10

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [HEALTH] $*" | tee -a "$LOG_FILE"
}
# ...checks for all critical services, disk, sync...
# return nonzero exit for any failure, log status

main() {
  # Check all, log, and return status
}
main

Example: Alert Manager (health/alert_manager.sh)

#!/usr/bin/env bash
source /opt/stellar/alert_config.env
log_alert() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [ALERT] $*" >> /var/log/alerts.log; }
send_telegram() { ... }
send_email() { ... }
send_discord() { ... }
send_slack() { ... }
send_alert() {
    # call enabled channels, apply rate limit
}
if [ "$1" ]; then send_alert "$1" "${2:-info}"; fi

Example: Monitoring Compose (docker-compose.production.yml)

services:
  pi-node:
    image: pinetwork/pi-node-docker:organization_mainnet-v1.3-p19.6
    # ...see above in detailed plan...
  prometheus:
    # ...
  grafana:
    # ...
  node-exporter:
    # ...
volumes:
  prometheus-data:
  grafana-data:

Example: Alerting Rule (monitoring/alerts.yml)

groups:
  - name: pi_node_alerts
    rules:
      - alert: NodeDown
        expr: pi_node_horizon_up == 0 OR pi_node_core_up == 0
      - alert: HighCPUUsage
        expr: pi_node_cpu_usage_percent > 90
      # etc.

Monitoring Quick Start (MONITORING.md excerpt)

1. cp alert_config.env.example alert_config.env  # edit your alert info
2. docker compose -f docker-compose.production.yml up -d
3. Access Grafana (localhost:3000), Prometheus (9090), Node Metrics (9105)
...

Upgrade Recommendations & Next Steps

How to turn this into a PR:

  • Copy these features (as code/scripts/configs above) into your repo under the right paths (health/, metrics/, monitoring/, etc.)
  • Document all new features in MONITORING.md and link it from the README
  • Test locally (and in a testnet) using realistic failure scenarios
  • Submit a new PR with a title like:
    • Enterprise-Ready Monitoring, Recovery, and Alert System for Pi Node Docker
  • In PR body, list all enhancements (as above) and a withdrawal-readiness checklist

@Clawue884
Copy link
Author

Thanks a lot for the detailed and thoughtful review, @Ze0ro99 🙏
I really appreciate the enterprise-grade direction you outlined. Your proposal goes far beyond basic monitoring and moves Pi Node Docker toward true production / mainnet-withdrawal readiness.
This PR (#20) was intended as a foundational step: introducing health checks, auto-recovery, real-time alerting, and Prometheus-compatible metrics in a lightweight and easily adoptable way.
Your suggestions around:
• advanced health logic
• multi-channel alert manager
• Prometheus alert rules
• Grafana dashboard provisioning
• hardened compose files
• documentation + install automation
are excellent and align perfectly with the long-term vision of running Pi Nodes at enterprise / infrastructure-grade standards.
I plan to prepare a follow-up PR that incorporates these ideas in a structured way (health/, metrics/, monitoring/, docs, provisioning, etc.) so the repo can evolve toward a full production-ready monitoring & recovery stack.
Thanks again for the high-quality technical feedback — it’s exactly the kind of engineering discussion this project needs. 🚀

@Ze0ro99
Copy link

Ze0ro99 commented Feb 6, 2026

Thank you for your response. I intend to help and submit withdrawal requests. It would also be beneficial to contribute more broadly.

@Clawue884
Copy link
Author

Thank you, @Ze0ro99 — really appreciate your support and willingness to contribute further.

Yes, absolutely. PR #20 is meant as a foundational step toward production-grade monitoring, and your enterprise upgrade proposal is now being used as the technical blueprint for the next phase.

That’s why I’ve started a follow-up PR (#21) to evolve this into a full enterprise-ready monitoring, recovery, and alerting stack:
– structured health/metrics/monitoring layout
– Prometheus + Grafana provisioning
– alert rules + install automation
– hardened compose for production

Your input is directly shaping the direction of this project.
Looking forward to collaborating more closely on making Pi Node Docker truly mainnet / withdrawal-ready at an infrastructure level 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants