-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
Allow users to configure outgoing webhooks and fine-grained alert rules, so that the platform can notify external systems (Slack, PagerDuty, custom HTTP endpoints, etc.) when specified events occur (e.g. workflow failures, long runtimes, success thresholds).
Motivation
- Operational visibility – Teams need near-real-time awareness of critical events without polling the dashboard.
- Incident response – Fast, automated routing of failures to on-call systems reduces MTTR.
- Ecosystem integration – Webhooks unlock custom tooling and reporting beyond the core product.
Goals
- UI & API for creating, listing, updating, and deleting webhook endpoints.
- Rule-based alert definitions (trigger = event filter + optional condition expression).
- Delivery with retry & exponential back-off, including dead-letter logging.
- Audit trail: who created/changed/deleted each webhook or rule.
- Tenant isolation: alerts fire only for resources the tenant owns.
- Minimum viable built-in events:
- Workflow failed
- Workflow succeeded
- Workflow exceeded duration X (configurable threshold)
Non-Goals
- In-app email/SMS notifications (tracked separately).
- Rich templating of payloads (v1 payload is fixed JSON schema).
Proposed Design
Data Model
WebhookEndpoint
id,name,target_url,secret,headers,is_active,created_by,created_at, …
AlertRule
id,name,event_type,condition,webhook_endpoint_id,is_active, …
Control Flow (high-level)
flowchart TD
A["Event emitted - e.g. WorkflowFailed"] --> B["Alert matcher"]
subgraph Alert_Engine["Alert Engine"]
B --> C{"Rules matching event_type?"}
C -->|yes| D["Condition eval (optional)"]
D -->|true| E["Enqueue delivery"]
end
E --> F["Delivery worker (retries, DLQ)"]
F --> G["Webhook target (external)"]
Delivery Semantics
- At-least-once delivery with up to N automatic retries (configurable per rule).
- HMAC-SHA256 signature header (
X-Exosphere-Signature) using stored secret. - 2xx = success, 4xx/5xx = retry (except 410/404 ⇒ disable rule).
API Endpoints (REST)
| Method | Path | Description |
|---|---|---|
POST |
/v1/webhooks |
Create endpoint |
GET |
/v1/webhooks |
List endpoints |
PATCH |
/v1/webhooks/{id} |
Update endpoint |
DELETE |
/v1/webhooks/{id} |
Delete (soft) |
| CRUD | /v1/alert-rules |
Same pattern for alert rules |
Note: OpenAPI spec will be updated in parallel.
Acceptance Criteria
- Users can add a webhook via Dashboard → Settings → Alerts.
- Users can define a rule “on workflow failure send POST to X”.
- Failing a test workflow triggers exactly one POST with correct JSON body.
- Dashboard shows delivery status/history per rule.
- Secrets are encrypted at rest.
- Unit tests (≥90 % coverage for alert engine).
- Integration tests covering happy path & retry logic.
- Documentation page
docs/exosphere/alerts.mdexplains setup & security.
Engineering Tasks
- DB migrations for
webhook_endpoints&alert_rulestables. - Backend: alert matcher service & delivery worker (state-manager package).
- Dashboard: UI components for endpoints & rules.
- Terraform/Helm: env vars & queues for delivery workers.
- Docs & example webhooks in
integration-tests/. - Rollout plan: feature flag off → beta tenants → GA.
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Flood of events overwhelms targets | Global & per-rule rate limits |
| Sensitive data leaked in payloads | Only include IDs & summary fields by default |
| Retry storms | Exponential back-off with jitter, max retry cap |
Additional Notes
- Aligns with feature-plan docs feat: Pass traces of failures to dashboard #634 & feat: Add UI based workflow builder #636 regarding trace visibility and workflow builder (alerts should reference the same event bus).
- Future iterations may add templated payloads and out-of-the-box Slack formatting.