Skip to content

Resumable Pipeline & UX Improvements for Long-Running Jobs #6

@cytokineking

Description

@cytokineking

First, thank you for releasing PXDesign.

After running several multi-hour jobs on multi-GPU setups, we encountered some pain points with the current pipeline:

  • No resume capability. If a job crashes, gets preempted, or is terminated (common on shared HPC clusters), all progress is lost. For jobs generating the recommended 10,000+ designs that can run dozens of hours, this is a significant issue.
  • Results only emitted at completion. CIF files are held in memory until the entire diffusion stage finishes. Users have no intermediate artifacts to inspect, and a late-stage failure means losing everything.
  • Difficult progress monitoring. It's hard to tell how far along a job is, especially in distributed multi-GPU runs.
  • Cluttered output structure. The current output layout mixes intermediate artifacts with final results, making it harder to find what matters.

We've implemented solutions for all of the above in this fork: https://github.com/cytokineking/PXDesign

Key changes:

  • World-size agnostic resume: Deterministic design_id partitioning with durable state tracking—resume on any GPU count
  • Incremental CIF streaming: Atomic writes after each diffusion chunk completes, so progress is never lost
  • Heartbeat monitoring: Standardized status.json files for real-time progress visibility
  • Versioned results output: Clean results/ snapshots that don't overwrite on re-ranking, with internal runs/ structure for reproducibility

All changes have been tested on single-GPU and multi-GPU configurations with resume scenarios at various pipeline stages.
Feel free to use any of this work if it's helpful—happy to discuss the approach or prepare a formal PR if there's interest.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions