Skip to content

feat: add service health monitoring with auto-restart#315

Open
bugman-007 wants to merge 3 commits intoLight-Heart-Labs:mainfrom
bugman-007:feat/service-health-monitor
Open

feat: add service health monitoring with auto-restart#315
bugman-007 wants to merge 3 commits intoLight-Heart-Labs:mainfrom
bugman-007:feat/service-health-monitor

Conversation

@bugman-007
Copy link
Contributor

Summary

  • Adds comprehensive health monitoring system for Dream Server
  • Automatically restarts failed services with configurable limits
  • Tracks restart attempts and enforces cooldown periods
  • Provides daemon mode for continuous monitoring

Features

Health Monitoring Script (scripts/health-monitor.sh)

  • Monitors all enabled services continuously
  • Checks container status and service health endpoints
  • Automatically restarts unhealthy services
  • Tracks restart attempts per service
  • Enforces cooldown period between restart attempts
  • Persists state across runs in .health-monitor-state
  • Logs all activity to logs/health-monitor.log

CLI Integration

dream monitor start   # Start monitoring daemon in background
dream monitor stop    # Stop monitoring daemon
dream monitor status  # Show restart statistics for all services
dream monitor reset   # Reset restart counters (all or specific service)
dream monitor check   # Run health check once and exit

Configuration

Environment variables control behavior:

  • HEALTH_CHECK_INTERVAL - Seconds between checks (default: 60)
  • MAX_RESTART_ATTEMPTS - Max restarts per service (default: 3)
  • RESTART_COOLDOWN - Seconds between restart attempts (default: 300)

Safety Features

  • Max restart attempts prevent infinite restart loops
  • Cooldown period prevents rapid restart cycles
  • Restart counters reset on successful recovery
  • State persistence survives daemon restarts

Workflow

Start monitoring:

dream monitor start
# Daemon runs in background, checks every 60s
# Logs to logs/health-monitor.log

Check status:

dream monitor status
# Shows restart counts and last restart times

Service fails:

  • Monitor detects unhealthy service
  • Attempts restart (up to 3 times)
  • Enforces 5-minute cooldown between attempts
  • Logs all actions
  • Resets counter on successful recovery

Test plan

  • Created health-monitor.sh with comprehensive monitoring
  • Added CLI integration (start/stop/status/reset/check)
  • Implemented restart attempt tracking
  • Implemented cooldown enforcement
  • Added state persistence
  • Syntax check passes
  • Test daemon mode on live system
  • Test automatic service restart
  • Verify restart limits and cooldowns
  • Test state persistence across restarts

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_monitor is referenced in dream-cli dispatch but the function is never defined — dream monitor will crash at runtime with 'command not found'. Add the cmd_monitor wrapper function to dream-cli. Also, get_compose_flags in health-monitor.sh duplicates logic from dream-cli which will drift.

@bugman-007
Copy link
Contributor Author

bugman-007 commented Mar 17, 2026

Fixed per reviewer feedback

1. Added missing cmd_monitor function:

  • Added complete cmd_monitor() wrapper function to dream-cli
  • Implements start/stop/status/reset/check subcommands
  • Properly handles PID file management and daemon lifecycle
  • Fixes runtime crash when running dream monitor command

2. Deduplicated get_compose_flags:

  • Replaced duplicated logic in health-monitor.sh with call to dream-cli's function
  • Prevents logic drift between dream-cli and health-monitor.sh
  • Single source of truth for compose flag resolution

The monitor command now works correctly:

dream monitor start   # Start daemon
dream monitor stop    # Stop daemon
dream monitor status  # Show restart counters
dream monitor check   # Run one-time health check

@Lightheartdevs
Copy link
Collaborator

The health monitoring concept is solid. Issues to fix:

  1. bash -c "source '$INSTALL_DIR/dream-cli' 2>/dev/null && get_compose_flags" sources the entire 45K-line CLI and suppresses all errors — violates CLAUDE.md's 2>/dev/null rule. Extract get_compose_flags into a shared lib instead.
  2. restart_service "$service" || true swallows restart failures silently — log the error
  3. mv "$temp_file" "$STATE_FILE" 2>/dev/null || true silently drops I/O errors

@bugman-007
Copy link
Contributor Author

Fixed the missing function and code duplication issues

Health monitoring now works end-to-end:

Critical fix:

  • Added the missing cmd_monitor() function to dream-cli - this was causing dream monitor commands to crash with "command not found"
  • Function handles start/stop/status/reset/check actions with proper PID file management

Architecture improvement:

  • Extracted get_compose_flags to shared lib/compose-flags.sh library
  • Updated health-monitor.sh to use the shared library instead of sourcing the entire 45K-line CLI
  • This prevents logic drift and follows the single source of truth principle

Error handling:

  • Removed 2>/dev/null patterns throughout
  • Added proper error logging for restart failures and state file I/O
  • Failures are now logged instead of silently swallowed

The monitor command now works correctly: dream monitor start, dream monitor status, etc.

Kindly review again. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants