Skip to content

[Bug] Failed resume attempts permanently poison orchestrator sessions #229

@srausser

Description

@srausser

Summary

If a resume attempt fails due to a tooling/runtime integration error, Ouroboros records the orchestrator session as failed and then refuses to resume it ever again. A transient resume bug therefore permanently burns the session.

Impact

This turns a recoverable runtime failure into irreversible session loss. Users hit one bad resume attempt and then cannot continue the run even after the underlying bug is fixed.

Environment

  • Ouroboros version: 0.26.0b7+sam.rawqa1
  • OS: macOS 15.7.3
  • Python version: 3.14.1
  • Installation method: local wheel/dev install
  • Model/provider: Codex CLI
  • Codex CLI version: 0.116.0
  • Relevant flags/config: ouroboros run workflow <seed> --debug --resume <session_id>

Steps to reproduce

  1. Start a workflow run and obtain a session id.
  2. Trigger a resume failure caused by an external/runtime integration issue rather than a true completed run. One example is the bad Codex resume command ordering bug.
  3. Retry ouroboros run workflow <seed_file> --debug --resume <session_id> after the underlying runtime problem has been fixed.

Expected behavior

A session that failed while trying to resume because of a recoverable runtime/tooling problem should either:

  • remain resumable, or
  • provide an explicit recovery path for retrying the resume.

At minimum, a transient resume integration bug should not permanently force a fresh run.

Actual behavior

The failed resume attempt writes a terminal failed state, and later resume attempts are rejected with:

Orchestrator error: Session is in terminal state failed, cannot resume

Logs or error output

2026-03-26T16:21:18.279754Z [error    ] orchestrator.session.failed    error="error: unexpected argument '-C' found ..."

2026-03-26T16:24:04.712903Z [info     ] orchestrator.session.reconstructed messages_processed=1138 session_id=orch_fda915674180 status=failed
Orchestrator error: Session is in terminal state failed, cannot resume (details: {'session_id': 'orch_fda915674180', 'status': 'failed'})

Minimal reproduction

  1. Run any workflow.
  2. Cause resume_session() to fail before useful forward progress because of a transient runtime integration error.
  3. Retry resume.

Acceptance criteria for fix

  • A failed resume attempt caused by recoverable runtime/tooling issues does not permanently make the session unresumable.
  • Users have a supported retry path after fixing the underlying resume/runtime problem.
  • Terminal-state enforcement distinguishes true terminal completion/failure from retryable resume/transport/runtime failures.
  • Tests cover the retry behavior for a failed resume path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugReproducible defect or broken behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions