Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion runners/launch_h200-dgxc-slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ EOF
else

HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache"
SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR updates the single-node SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ (line 231), but the multinode branch still references the old /data/containers/ path at lines 63, 67, and 72. Since both branches target the same H200 DGXC cluster and the PR title says "update directories on h200 dgxc cluster," these multinode paths likely need the same update or multinode jobs will fail to find the squash files.

Extended reasoning...

What the bug is

This PR is titled "fix: update directories on h200 dgxc cluster" and the description states "Updates were made to H200 DGXC physical cluster. This PR fixes accordingly." The change updates the single-node branch's SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ at line 231. However, three references to the old /data/containers/ path in the multinode (IS_MULTINODE=true) branch are left unchanged:

  • Line 63: NGINX_SQUASH_FILE="/data/containers/nginx+1.27.4.sqsh"
  • Line 67: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh"
  • Line 72: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"

How it manifests

Both branches operate on the same physical H200 DGXC Slurm cluster (same SLURM_PARTITION="main", same SLURM_ACCOUNT="sa-shared"). If the physical cluster directories were reorganized — which is what the PR title and description explicitly state — then the old /data/containers/ path may no longer exist or may not contain the expected squash files. Any multinode job launched after this change would fail when srtctl tries to map container images to squash files that don't exist at the old paths.

Step-by-step proof

  1. The PR changes line 231 from SQUASH_FILE="/data/containers/..." to SQUASH_FILE="/data/gharunners/containers/..." in the single-node (else) branch.
  2. A multinode job is submitted with IS_MULTINODE=true on the same cluster.
  3. The script reaches lines 63-72 and sets NGINX_SQUASH_FILE and SQUASH_FILE using the old /data/containers/ prefix.
  4. These paths are written into srtslurm.yaml (lines 97-101) under the containers: section.
  5. When srtctl apply runs, it references squash files at /data/containers/ which may no longer exist on the updated cluster.
  6. The multinode job fails because the container squash files cannot be found.

Addressing the counterargument

One reviewer noted that the multinode branch uses a different toolchain (srt-slurm/srtctl) with different conventions (e.g., + as sed separator vs _), and that the HF cache path at line 13 also lacks the gharunners prefix (/models/ vs /models/gharunners/). This is a fair observation — it's possible srt-slurm has its own directory layout. However, the squash files are stored on the same physical filesystem regardless of which orchestration tool accesses them. The sed separator difference (+ vs _) is about filename formatting, not directory structure. The /models/ path at line 13 is a different mount entirely and doesn't speak to the /data/containers/ vs /data/gharunners/containers/ question.

Impact and recommendation

If the directory migration is cluster-wide, multinode jobs will be broken. Even if there is a legitimate reason these paths should differ, the inconsistency should be explicitly addressed. The PR author should confirm whether lines 63, 67, and 72 also need updating to /data/gharunners/containers/.


# Convert pyxis image format (nvcr.io#path) to docker format (nvcr.io/path) for enroot import
DOCKER_IMAGE=$(echo "$IMAGE" | sed 's/#/\//g')
Expand Down
Loading