-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
We need hands-off alerting for node/training failures.
Scope:
- Add alert sender with a pluggable backend (Discord webhook preferred for free tier), alternative: Slack webhook or SMTP email (free options).
- Trigger on critical self-check failures (S3 down, missing env, vendor outage), and on repeated job failures.
- Integrate into node loop and training loop; keep alerts rate-limited and deduplicated.
Proposed approach:
- Implement ops/alerts.py with send_alert(title, body) and backends: discord_webhook, slack_webhook, email(smtp).
- Configure via .env: ALERT_BACKEND, DISCORD_WEBHOOK_URL, SLACK_WEBHOOK_URL, SMTP_*.
- Add a small circuit breaker to avoid alert storms.
Acceptance:
- Simulated S3 outage -> one alert created and logged, retries after backoff without spamming.
- Missing vendor keys -> warning notifies once per day until resolved.
- Documentation in README with setup instructions.
Notes:
- Start with Discord Webhook (free) as default backend; others optional.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request