Skip to content

feat(v0.12): Diagnostic Event Catalog — surface structured OpenClaw events as UI components#262

Open
vivekchand wants to merge 1 commit intomainfrom
fix/gh-clawmetry-36-diagnostic-events
Open

feat(v0.12): Diagnostic Event Catalog — surface structured OpenClaw events as UI components#262
vivekchand wants to merge 1 commit intomainfrom
fix/gh-clawmetry-36-diagnostic-events

Conversation

@vivekchand
Copy link
Owner

Closes #36

What

Implements all 5 missing diagnostic event handlers from the catalog spec, surfacing them as visible UI components on the Flow tab.

How

Backend (dashboard.py)

  • ** additions**: 4 new buckets — sessions, gw_health, retries, dequeues
  • OTLP metric handlers in _process_otlp_metrics:
    • openclaw.session.state → tracks session state changes
    • openclaw.session.stuck → flags stuck sessions (clears with timestamp)
    • openclaw.diagnostic.heartbeat → gateway health pulse (healthy if < 120s ago)
    • openclaw.run.attempt → per-session retry counter
  • New endpoint: GET /api/diagnostic-events — returns gateway health, stuck sessions, session states, retry counts, and queue wait time in one response

Frontend (JS log tail + UI)

  • lane dequeue detection: records enqueue timestamp at lane enqueue, computes wait-time delta at lane dequeue, displays in flow feed
  • run.attempt detection: parses retry number from log line, bumps retry badge
  • session.stuck detection: shows dismissable alert banner on Flow tab
  • Gateway Health stat: new flow-stats card, polls /api/diagnostic-events every 15 events, shows ✅ Healthy / ⚠️ Degraded based on heartbeat age
  • Queue Wait stat: shows average enqueue→dequeue wait (hidden when no data)
  • Retry count badge: shows retry count (hidden when zero)
  • Stuck session banner: auto-dismisses after 30s, shows timestamp

Acceptance criteria (from issue #36)

  • openclaw.queue.lane.depth parsed and shown in queue depth panel (pre-existing)
  • openclaw.session.stuck surfaces stuck-session alert
  • openclaw.diagnostic.heartbeat drives gateway health indicator in header
  • Queue wait time (enqueue→dequeue) computed from log tail
  • Retry count visible on session cards

…n.stuck, diagnostic.heartbeat, run.attempt, lane dequeue as UI components (closes #36)

- Add session.state and session.stuck OTLP metric handlers to metrics_store
- Add openclaw.diagnostic.heartbeat handler → gateway health pulse tracking
- Add openclaw.run.attempt handler → per-session retry count tracking
- Add /api/diagnostic-events endpoint exposing all 5 event types
- Flow tab: Gateway Health stat (green/amber/unknown via OTLP heartbeat)
- Flow tab: Queue Wait stat showing enqueue→dequeue delta from log tail
- Flow tab: Retry count badge when run.attempt events detected
- Flow tab: Stuck session alert banner (dismissable, auto-hides in 30s)
- JS log handler: lane dequeue detection with wait-time computation
- JS log handler: run.attempt retry badge from log tail
- JS log handler: session.stuck alert from log tail
- JS: polls /api/diagnostic-events every 15 events to update indicators
Copy link
Owner Author

@vivekchand vivekchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good implementation overall -- all 5 event types are wired up and the new /api/diagnostic-events endpoint is clean.

One gap worth a follow-up: the dequeues bucket in metrics_store is never written to by the OTLP path. There is no handler for openclaw.lane.dequeue in _process_otlp_metrics, so queue_avg_wait_ms in the API response will always be 0 when data arrives via OTLP. Queue wait currently only works from the JS frontend (window._laneEnqueueTs set on log-tail lane enqueue lines). If the gateway emits openclaw.lane.dequeue as a metric, a matching elif name == 'openclaw.lane.dequeue' block in _process_otlp_metrics (storing enqueue_ts from an attribute) would close the loop.

Everything else looks good -- the stuck-session banner, gateway health card, and retry badge all follow the existing patterns correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(v0.11): Diagnostic Event Catalog — surface structured OpenClaw events as UI components

1 participant