By integrating NVRX, we can leverage process-level restarts on Slurm. SageMaker HyperPod and ParallelCluster/PCS leverages a managed Slurm architecture that includes infrastructure level monitoring and auto-resume when an infrastructure component fails. But it lacks process-level restart to avoid job interruption when a single (or multiple) processes dies.