Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter by Clawue884 · Pull Request #20 · PiCoreTeam/pi-node-docker

Clawue884 · 2026-02-05T18:14:32Z

This PR adds production-grade monitoring:
– Docker health check
– Auto-recovery supervisor service
– Telegram real-time alerts
– Prometheus metrics exporter + Grafana ready

Implement an auto-recovery script that checks node health and restarts services if unhealthy.

Add a script to check the status of Horizon and Stellar-core services.

This script checks the health of the Horizon and stellar-core services by making HTTP requests and reporting their status.

Add health check script and configure Docker healthcheck.

This script continuously checks the health of services and restarts them if they are not healthy.

Ze0ro99

Here is a comprehensive, professional upgrade plan and set of production-ready features—fully detailed and suitable for both mainnet withdrawal readiness and enterprise-level node operations. This is a superset of the changes in PR #20, and you can use this as the technical basis for a follow-up PR or revised submission.

🚀 Enterprise-Grade Monitoring & Auto-Recovery Proposal for Pi Node Docker

1. Advanced Health Monitoring (`health/healthcheck.sh`)

Features:

Checks all critical services (Horizon, Stellar-Core, PostgreSQL)
Confirms blockchain sync and API health
Checks disk space, resource use, and logs health outcomes
Returns unique status codes and logs with timestamps

2. Advanced Auto-Recovery System (`health/auto_recover.sh`)

Features:

Exponential backoff, max retry & wait logic
Sends alerts on each failure/recovery step
Tracks recovery event history in separate log file
Graceful degradation + disables repeated flapping

3. Professional Multi-Channel Alerting (`health/alert_manager.sh`)

Features:

Supports Telegram, Email, Discord, Slack
Customizable severity, rate-limiting, deduplication
Configurable via env file (alert_config.env)
Alert history, status, and example template included

4. Production Metrics and Monitoring (`metrics/node_metrics.sh`)

Features:

Prometheus-formatted output with service, resource, ledger, and custom node metrics
Exposes hardware usage (CPU/Memory/Disk) and blockchain indicators for dashboards and alerting

5. Hardened Docker Compose Files (`docker-compose.production.yml`)

Features:

Integrated stack: node, Prometheus, Grafana, Node Exporter for host metrics, alert config volumes
Port mapping and healthcheck embedded
Logging & resource control

6. Comprehensive Prometheus & Alert Rules (`monitoring/prometheus.yml`, `alerts.yml`)

Features:

Prometheus scrapes all key services
Sample alert rules (down, high CPU/RAM/Disk, etc.)

7. Grafana Dashboard Provisioning

Features:

Ready-to-import dashboard JSON file for full overview (service status, sync, resource, alerts, events)

8. New Monitoring Documentation (`MONITORING.md`)

Features:

All setup, config, and troubleshooting steps with withdrawal-readiness checklist

9. SupervisorD Updates

Features:

Priority, log rotation, auto-recovery and metric server startup with logging and environment

10. Automated Install Script (`setup-monitoring.sh`)

Features:

Prepares all directories, permissions, envs, example configs
Guides user post-install for secure and correct startup

Example: Enhanced Health Check Script (health/healthcheck.sh)

#!/usr/bin/env bash
set -e
LOG_FILE="/var/log/healthcheck.log"
MIN_DISK_SPACE_GB=10

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [HEALTH] $*" | tee -a "$LOG_FILE"
}
# ...checks for all critical services, disk, sync...
# return nonzero exit for any failure, log status

main() {
  # Check all, log, and return status
}
main

Example: Alert Manager (health/alert_manager.sh)

#!/usr/bin/env bash
source /opt/stellar/alert_config.env
log_alert() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [ALERT] $*" >> /var/log/alerts.log; }
send_telegram() { ... }
send_email() { ... }
send_discord() { ... }
send_slack() { ... }
send_alert() {
    # call enabled channels, apply rate limit
}
if [ "$1" ]; then send_alert "$1" "${2:-info}"; fi

Example: Monitoring Compose (`docker-compose.production.yml`)

services:
  pi-node:
    image: pinetwork/pi-node-docker:organization_mainnet-v1.3-p19.6
    # ...see above in detailed plan...
  prometheus:
    # ...
  grafana:
    # ...
  node-exporter:
    # ...
volumes:
  prometheus-data:
  grafana-data:

Example: Alerting Rule (monitoring/alerts.yml)

groups:
  - name: pi_node_alerts
    rules:
      - alert: NodeDown
        expr: pi_node_horizon_up == 0 OR pi_node_core_up == 0
      - alert: HighCPUUsage
        expr: pi_node_cpu_usage_percent > 90
      # etc.

Monitoring Quick Start (MONITORING.md excerpt)

1. cp alert_config.env.example alert_config.env  # edit your alert info
2. docker compose -f docker-compose.production.yml up -d
3. Access Grafana (localhost:3000), Prometheus (9090), Node Metrics (9105)
...

Upgrade Recommendations & Next Steps

How to turn this into a PR:

Copy these features (as code/scripts/configs above) into your repo under the right paths (health/, metrics/, monitoring/, etc.)
Document all new features in MONITORING.md and link it from the README
Test locally (and in a testnet) using realistic failure scenarios
Submit a new PR with a title like:
- Enterprise-Ready Monitoring, Recovery, and Alert System for Pi Node Docker
In PR body, list all enhancements (as above) and a withdrawal-readiness checklist

Clawue884 · 2026-02-06T01:32:32Z

Thanks a lot for the detailed and thoughtful review, @Ze0ro99 🙏
I really appreciate the enterprise-grade direction you outlined. Your proposal goes far beyond basic monitoring and moves Pi Node Docker toward true production / mainnet-withdrawal readiness.
This PR (#20) was intended as a foundational step: introducing health checks, auto-recovery, real-time alerting, and Prometheus-compatible metrics in a lightweight and easily adoptable way.
Your suggestions around:
• advanced health logic
• multi-channel alert manager
• Prometheus alert rules
• Grafana dashboard provisioning
• hardened compose files
• documentation + install automation
are excellent and align perfectly with the long-term vision of running Pi Nodes at enterprise / infrastructure-grade standards.
I plan to prepare a follow-up PR that incorporates these ideas in a structured way (health/, metrics/, monitoring/, docs, provisioning, etc.) so the repo can evolve toward a full production-ready monitoring & recovery stack.
Thanks again for the high-quality technical feedback — it’s exactly the kind of engineering discussion this project needs. 🚀

Implement a health check script to monitor services and disk space.

Ze0ro99 · 2026-02-06T18:45:52Z

Thank you for your response. I intend to help and submit withdrawal requests. It would also be beneficial to contribute more broadly.

Clawue884 · 2026-02-08T04:05:49Z

Thank you, @Ze0ro99 — really appreciate your support and willingness to contribute further.

Yes, absolutely. PR #20 is meant as a foundational step toward production-grade monitoring, and your enterprise upgrade proposal is now being used as the technical blueprint for the next phase.

That’s why I’ve started a follow-up PR (#21) to evolve this into a full enterprise-ready monitoring, recovery, and alerting stack:
– structured health/metrics/monitoring layout
– Prometheus + Grafana provisioning
– alert rules + install automation
– hardened compose for production

Your input is directly shaping the direction of this project.
Looking forward to collaborating more closely on making Pi Node Docker truly mainnet / withdrawal-ready at an infrastructure level 🚀

Clawue884 added 13 commits February 6, 2026 01:06

Add Telegram alert script for notifications

927af54

Add auto_recover.sh for health monitoring

013aaeb

Implement an auto-recovery script that checks node health and restarts services if unhealthy.

Add supervisord configuration for auto_recover program

bce730e

Create node_metrics.sh for service health checks

3e4469f

Add a script to check the status of Horizon and Stellar-core services.

Add metrics server script to serve metrics

36117a9

Add supervisord configuration for metrics server

cb6eb36

Add Prometheus configuration for pi-node monitoring

fd7e2ea

Add Docker Compose configuration for monitoring services

cc2f2aa

Add healthcheck script for Horizon and stellar-core

2b9f339

This script checks the health of the Horizon and stellar-core services by making HTTP requests and reporting their status.

Add health check support to Dockerfile

89e8c4e

Add health check script and configure Docker healthcheck.

Add Docker Compose configuration for mainnet service

d739fba

Add auto-recovery script for service monitoring

79d056a

This script continuously checks the health of services and restarts them if they are not healthy.

Add auto_recover program configuration to supervisord

25b815e

Ze0ro99 suggested changes Feb 5, 2026

View reviewed changes

Clawue884 added 3 commits February 6, 2026 09:33

Add healthcheck_v2.sh for service and disk monitoring

605486d

Implement a health check script to monitor services and disk space.

Add healthcheck auto-recovery script

71b1e5a

Add alert_manager.sh for health alert management

7104a58

Clawue884 mentioned this pull request Feb 6, 2026

Enterprise-Ready Monitoring, Recovery & Alerting Stack for Pi Node Docker #21

Open

This was referenced Feb 8, 2026

Enterprise Monitoring Stack: Prometheus + Grafana + Auto-Recovery for Pi Node Docker #24

Open

Enterprise Monitoring Stack: Prometheus + Grafana + Auto-Recovery for Pi Node Docker #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20

Add Health Check, Auto-Recovery, Telegram Alerts, and Prometheus Metrics Exporter#20
Clawue884 wants to merge 16 commits intoPiCoreTeam:masterfrom
Clawue884:master

Clawue884 commented Feb 5, 2026

Uh oh!

Ze0ro99 left a comment

Uh oh!

Clawue884 commented Feb 6, 2026

Uh oh!

Ze0ro99 commented Feb 6, 2026

Uh oh!

Clawue884 commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Clawue884 commented Feb 5, 2026

Uh oh!

Ze0ro99 left a comment

Choose a reason for hiding this comment

🚀 Enterprise-Grade Monitoring & Auto-Recovery Proposal for Pi Node Docker

1. Advanced Health Monitoring (health/healthcheck.sh)

2. Advanced Auto-Recovery System (health/auto_recover.sh)

3. Professional Multi-Channel Alerting (health/alert_manager.sh)

4. Production Metrics and Monitoring (metrics/node_metrics.sh)

5. Hardened Docker Compose Files (docker-compose.production.yml)

6. Comprehensive Prometheus & Alert Rules (monitoring/prometheus.yml, alerts.yml)

7. Grafana Dashboard Provisioning

8. New Monitoring Documentation (MONITORING.md)

9. SupervisorD Updates

10. Automated Install Script (setup-monitoring.sh)

Example: Enhanced Health Check Script (health/healthcheck.sh)

Example: Alert Manager (health/alert_manager.sh)

Example: Monitoring Compose (docker-compose.production.yml)

Example: Alerting Rule (monitoring/alerts.yml)

Monitoring Quick Start (MONITORING.md excerpt)

Upgrade Recommendations & Next Steps

Uh oh!

Clawue884 commented Feb 6, 2026

Uh oh!

Ze0ro99 commented Feb 6, 2026

Uh oh!

Clawue884 commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Advanced Health Monitoring (`health/healthcheck.sh`)

2. Advanced Auto-Recovery System (`health/auto_recover.sh`)

3. Professional Multi-Channel Alerting (`health/alert_manager.sh`)

4. Production Metrics and Monitoring (`metrics/node_metrics.sh`)

5. Hardened Docker Compose Files (`docker-compose.production.yml`)

6. Comprehensive Prometheus & Alert Rules (`monitoring/prometheus.yml`, `alerts.yml`)

8. New Monitoring Documentation (`MONITORING.md`)

10. Automated Install Script (`setup-monitoring.sh`)

Example: Monitoring Compose (`docker-compose.production.yml`)