Problem
When a Snakemake rule uses the retries directive and fails on an early attempt but succeeds on a later retry, snakesee still reports the job as failed. The failed_jobs count is never cleared when a retry succeeds.
This means after a successful retry, the workflow state can show completed_jobs=1 AND failed_jobs=1 simultaneously for the same job, which is contradictory.
Root cause
parse_failed_jobs_from_log() collects all Error in rule X: blocks it encounters. It deduplicates by (rule, jobid), but never removes a failure when the same job later has a Finished job X line. Similarly, _reconcile_job_lists() removes failed jobs from the running list but doesn't clear them from the failed list on completion.
Reproduction
Unit tests in TestRetryBehavior in tests/test_parser.py (branch yf_dynamic-table-rows) document the current behavior.
Design question: what should the expected behavior be?
A few options:
-
Clear on success — If a job eventually finishes, remove it from failed_jobs entirely. Simple, but loses visibility that it had trouble.
-
New state (e.g., "errored" / "retried") — A job that failed but was retried gets a distinct state. If the retry succeeds, show it as "completed (with errors)" or similar. If retries are exhausted, it becomes "failed". Preserves history but adds model complexity.
-
Track attempt count — Show retried-then-succeeded jobs as normal completions but annotate them (e.g., align ✓ (2 attempts)). Reserve "failed" for jobs that never succeeded.
Input from maintainers welcome on which approach fits best.
Problem
When a Snakemake rule uses the
retriesdirective and fails on an early attempt but succeeds on a later retry, snakesee still reports the job as failed. Thefailed_jobscount is never cleared when a retry succeeds.This means after a successful retry, the workflow state can show
completed_jobs=1ANDfailed_jobs=1simultaneously for the same job, which is contradictory.Root cause
parse_failed_jobs_from_log()collects allError in rule X:blocks it encounters. It deduplicates by(rule, jobid), but never removes a failure when the same job later has aFinished job Xline. Similarly,_reconcile_job_lists()removes failed jobs from the running list but doesn't clear them from the failed list on completion.Reproduction
Unit tests in
TestRetryBehaviorintests/test_parser.py(branchyf_dynamic-table-rows) document the current behavior.Design question: what should the expected behavior be?
A few options:
Clear on success — If a job eventually finishes, remove it from
failed_jobsentirely. Simple, but loses visibility that it had trouble.New state (e.g., "errored" / "retried") — A job that failed but was retried gets a distinct state. If the retry succeeds, show it as "completed (with errors)" or similar. If retries are exhausted, it becomes "failed". Preserves history but adds model complexity.
Track attempt count — Show retried-then-succeeded jobs as normal completions but annotate them (e.g.,
align ✓ (2 attempts)). Reserve "failed" for jobs that never succeeded.Input from maintainers welcome on which approach fits best.