Skip to content

feat: Add support for webooks #633

@NiveditJain

Description

@NiveditJain

Allow users to configure outgoing webhooks and fine-grained alert rules, so that the platform can notify external systems (Slack, PagerDuty, custom HTTP endpoints, etc.) when specified events occur (e.g. workflow failures, long runtimes, success thresholds).

Motivation

  1. Operational visibility – Teams need near-real-time awareness of critical events without polling the dashboard.
  2. Incident response – Fast, automated routing of failures to on-call systems reduces MTTR.
  3. Ecosystem integration – Webhooks unlock custom tooling and reporting beyond the core product.

Goals

  • UI & API for creating, listing, updating, and deleting webhook endpoints.
  • Rule-based alert definitions (trigger = event filter + optional condition expression).
  • Delivery with retry & exponential back-off, including dead-letter logging.
  • Audit trail: who created/changed/deleted each webhook or rule.
  • Tenant isolation: alerts fire only for resources the tenant owns.
  • Minimum viable built-in events:
    • Workflow failed
    • Workflow succeeded
    • Workflow exceeded duration X (configurable threshold)

Non-Goals

  • In-app email/SMS notifications (tracked separately).
  • Rich templating of payloads (v1 payload is fixed JSON schema).

Proposed Design

Data Model

WebhookEndpoint

  • id, name, target_url, secret, headers, is_active, created_by, created_at, …

AlertRule

  • id, name, event_type, condition, webhook_endpoint_id, is_active, …

Control Flow (high-level)

flowchart TD
    A["Event emitted - e.g. WorkflowFailed"] --> B["Alert matcher"]
    subgraph Alert_Engine["Alert Engine"]
        B --> C{"Rules matching event_type?"}
        C -->|yes| D["Condition eval (optional)"]
        D -->|true| E["Enqueue delivery"]
    end
    E --> F["Delivery worker (retries, DLQ)"]
    F --> G["Webhook target (external)"]
Loading

Delivery Semantics

  1. At-least-once delivery with up to N automatic retries (configurable per rule).
  2. HMAC-SHA256 signature header (X-Exosphere-Signature) using stored secret.
  3. 2xx = success, 4xx/5xx = retry (except 410/404 ⇒ disable rule).

API Endpoints (REST)

Method Path Description
POST /v1/webhooks Create endpoint
GET /v1/webhooks List endpoints
PATCH /v1/webhooks/{id} Update endpoint
DELETE /v1/webhooks/{id} Delete (soft)
CRUD /v1/alert-rules Same pattern for alert rules

Note: OpenAPI spec will be updated in parallel.

Acceptance Criteria

  • Users can add a webhook via Dashboard → Settings → Alerts.
  • Users can define a rule “on workflow failure send POST to X”.
  • Failing a test workflow triggers exactly one POST with correct JSON body.
  • Dashboard shows delivery status/history per rule.
  • Secrets are encrypted at rest.
  • Unit tests (≥90 % coverage for alert engine).
  • Integration tests covering happy path & retry logic.
  • Documentation page docs/exosphere/alerts.md explains setup & security.

Engineering Tasks

  1. DB migrations for webhook_endpoints & alert_rules tables.
  2. Backend: alert matcher service & delivery worker (state-manager package).
  3. Dashboard: UI components for endpoints & rules.
  4. Terraform/Helm: env vars & queues for delivery workers.
  5. Docs & example webhooks in integration-tests/.
  6. Rollout plan: feature flag off → beta tenants → GA.

Risks & Mitigations

Risk Mitigation
Flood of events overwhelms targets Global & per-rule rate limits
Sensitive data leaked in payloads Only include IDs & summary fields by default
Retry storms Exponential back-off with jitter, max retry cap

Additional Notes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions