-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
enhancementNew feature or requestNew feature or request
Description
First, thank you for releasing PXDesign.
After running several multi-hour jobs on multi-GPU setups, we encountered some pain points with the current pipeline:
- No resume capability. If a job crashes, gets preempted, or is terminated (common on shared HPC clusters), all progress is lost. For jobs generating the recommended 10,000+ designs that can run dozens of hours, this is a significant issue.
- Results only emitted at completion. CIF files are held in memory until the entire diffusion stage finishes. Users have no intermediate artifacts to inspect, and a late-stage failure means losing everything.
- Difficult progress monitoring. It's hard to tell how far along a job is, especially in distributed multi-GPU runs.
- Cluttered output structure. The current output layout mixes intermediate artifacts with final results, making it harder to find what matters.
We've implemented solutions for all of the above in this fork: https://github.com/cytokineking/PXDesign
Key changes:
- World-size agnostic resume: Deterministic design_id partitioning with durable state tracking—resume on any GPU count
- Incremental CIF streaming: Atomic writes after each diffusion chunk completes, so progress is never lost
- Heartbeat monitoring: Standardized status.json files for real-time progress visibility
- Versioned results output: Clean results/ snapshots that don't overwrite on re-ranking, with internal runs/ structure for reproducibility
All changes have been tested on single-GPU and multi-GPU configurations with resume scenarios at various pipeline stages.
Feel free to use any of this work if it's helpful—happy to discuss the approach or prepare a formal PR if there's interest.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request