Skip to content

feat: Pass traces of failures to dashboard #634

@NiveditJain

Description

@NiveditJain

When a workflow or task fails today, the dashboard surfaces only the error message that bubbled up from the runtime. Full exception traces are streamed to the backend logs but are not accessible from the UI. This forces developers to leave the dashboard and dig through log aggregators to diagnose problems.

Goal

Expose the complete trace associated with a failed run directly in the dashboard so that developers can download it with a single click.

Current Behaviour

  1. Backend returns a JSON payload for failed runs that contains the error_message string only.
  2. The dashboard lists each run with a red status chip and the error message (truncated after ~120 chars).
  3. There is no UI affordance to retrieve the trace.

Proposed Solution

Backend

  • Augment failure payloads with a field that resolves to a stored trace (e.g. S3 object, database blob).
  • Expose GET /api/runs/{run_id}/trace that returns
    • Content-Disposition: attachment; filename="{run_id}_trace.txt"
    • Raw text stack-trace in the body.
  • Keep response size reasonable (<10 MB) via gzip or truncation of middle frames.

Dashboard UI

  1. Replace static error‐message cell with an interactive chip.
  2. On hover / click, open a popover showing the full error message and a "View Trace" button.
  3. Clicking the button downloads run_<id>_trace.txt using the native browser download flow.
  4. No in-browser rendering required; developers can open in their IDE of choice.

Acceptance Criteria

  • Failed run rows display a clickable element that opens the popover.
  • Popover contains the full error message without truncation.
  • "View Trace" button triggers a file download of the trace.
  • If a trace is unavailable, the button is disabled and a tooltip explains why.
  • API endpoint is authenticated and honours RBAC (same as runs endpoint).

Open Questions

  1. Storage location – Persist traces in S3 vs database? Expected retention?
  2. Large traces – Impose max file size or stream?
  3. Security – Do we need additional masking (e.g. secrets) before exposing traces?

Definition of Done

  • Backend PR merged exposing the new endpoint and trace persistence logic.
  • Dashboard PR merged implementing the UI changes.
  • E2E test: trigger a synthetic failure and assert that the downloaded trace matches backend logs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions