Skip to content

Jobs with successful retries still shown as failed #62

@yfarjoun

Description

@yfarjoun

Problem

When a Snakemake rule uses the retries directive and fails on an early attempt but succeeds on a later retry, snakesee still reports the job as failed. The failed_jobs count is never cleared when a retry succeeds.

This means after a successful retry, the workflow state can show completed_jobs=1 AND failed_jobs=1 simultaneously for the same job, which is contradictory.

Root cause

parse_failed_jobs_from_log() collects all Error in rule X: blocks it encounters. It deduplicates by (rule, jobid), but never removes a failure when the same job later has a Finished job X line. Similarly, _reconcile_job_lists() removes failed jobs from the running list but doesn't clear them from the failed list on completion.

Reproduction

Unit tests in TestRetryBehavior in tests/test_parser.py (branch yf_dynamic-table-rows) document the current behavior.

Design question: what should the expected behavior be?

A few options:

  1. Clear on success — If a job eventually finishes, remove it from failed_jobs entirely. Simple, but loses visibility that it had trouble.

  2. New state (e.g., "errored" / "retried") — A job that failed but was retried gets a distinct state. If the retry succeeds, show it as "completed (with errors)" or similar. If retries are exhausted, it becomes "failed". Preserves history but adds model complexity.

  3. Track attempt count — Show retried-then-succeeded jobs as normal completions but annotate them (e.g., align ✓ (2 attempts)). Reserve "failed" for jobs that never succeeded.

Input from maintainers welcome on which approach fits best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions