fix: update directories on h200 dgxc cluster#902
Conversation
|
@claude Test this PR by using runner model sweep on e2e test workflow on this runner |
|
Claude finished @cquil11's task in 13m 1s —— View job Testing PR with E2E Workflow
ResultsBoth runs failed due to a SLURM cluster infrastructure issue, not related to this PR's change. The PR change itself (updating However, all jobs failed at the This error occurs at Summary
Recommendation: The SLURM GRES configuration on the H200 DGXC cluster may need to be investigated. The |
|
|
||
| HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache" | ||
| SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" | ||
| SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" |
There was a problem hiding this comment.
🔴 The PR updates the single-node SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ (line 231), but the multinode branch still references the old /data/containers/ path at lines 63, 67, and 72. Since both branches target the same H200 DGXC cluster and the PR title says "update directories on h200 dgxc cluster," these multinode paths likely need the same update or multinode jobs will fail to find the squash files.
Extended reasoning...
What the bug is
This PR is titled "fix: update directories on h200 dgxc cluster" and the description states "Updates were made to H200 DGXC physical cluster. This PR fixes accordingly." The change updates the single-node branch's SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ at line 231. However, three references to the old /data/containers/ path in the multinode (IS_MULTINODE=true) branch are left unchanged:
- Line 63:
NGINX_SQUASH_FILE="/data/containers/nginx+1.27.4.sqsh" - Line 67:
SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh" - Line 72:
SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"
How it manifests
Both branches operate on the same physical H200 DGXC Slurm cluster (same SLURM_PARTITION="main", same SLURM_ACCOUNT="sa-shared"). If the physical cluster directories were reorganized — which is what the PR title and description explicitly state — then the old /data/containers/ path may no longer exist or may not contain the expected squash files. Any multinode job launched after this change would fail when srtctl tries to map container images to squash files that don't exist at the old paths.
Step-by-step proof
- The PR changes line 231 from
SQUASH_FILE="/data/containers/..."toSQUASH_FILE="/data/gharunners/containers/..."in the single-node (else) branch. - A multinode job is submitted with
IS_MULTINODE=trueon the same cluster. - The script reaches lines 63-72 and sets
NGINX_SQUASH_FILEandSQUASH_FILEusing the old/data/containers/prefix. - These paths are written into
srtslurm.yaml(lines 97-101) under thecontainers:section. - When
srtctl applyruns, it references squash files at/data/containers/which may no longer exist on the updated cluster. - The multinode job fails because the container squash files cannot be found.
Addressing the counterargument
One reviewer noted that the multinode branch uses a different toolchain (srt-slurm/srtctl) with different conventions (e.g., + as sed separator vs _), and that the HF cache path at line 13 also lacks the gharunners prefix (/models/ vs /models/gharunners/). This is a fair observation — it's possible srt-slurm has its own directory layout. However, the squash files are stored on the same physical filesystem regardless of which orchestration tool accesses them. The sed separator difference (+ vs _) is about filename formatting, not directory structure. The /models/ path at line 13 is a different mount entirely and doesn't speak to the /data/containers/ vs /data/gharunners/containers/ question.
Impact and recommendation
If the directory migration is cluster-wide, multinode jobs will be broken. Even if there is a legitimate reason these paths should differ, the inconsistency should be explicitly addressed. The PR author should confirm whether lines 63, 67, and 72 also need updating to /data/gharunners/containers/.
Updates were made to H200 DGXC physical cluster. This PR fixes accordingly.