-
Notifications
You must be signed in to change notification settings - Fork 99
fix: update directories on h200 dgxc cluster #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The PR updates the single-node SQUASH_FILE path from
/data/containers/to/data/gharunners/containers/(line 231), but the multinode branch still references the old/data/containers/path at lines 63, 67, and 72. Since both branches target the same H200 DGXC cluster and the PR title says "update directories on h200 dgxc cluster," these multinode paths likely need the same update or multinode jobs will fail to find the squash files.Extended reasoning...
What the bug is
This PR is titled "fix: update directories on h200 dgxc cluster" and the description states "Updates were made to H200 DGXC physical cluster. This PR fixes accordingly." The change updates the single-node branch's
SQUASH_FILEpath from/data/containers/to/data/gharunners/containers/at line 231. However, three references to the old/data/containers/path in the multinode (IS_MULTINODE=true) branch are left unchanged:NGINX_SQUASH_FILE="/data/containers/nginx+1.27.4.sqsh"SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh"SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"How it manifests
Both branches operate on the same physical H200 DGXC Slurm cluster (same
SLURM_PARTITION="main", sameSLURM_ACCOUNT="sa-shared"). If the physical cluster directories were reorganized — which is what the PR title and description explicitly state — then the old/data/containers/path may no longer exist or may not contain the expected squash files. Any multinode job launched after this change would fail whensrtctltries to map container images to squash files that don't exist at the old paths.Step-by-step proof
SQUASH_FILE="/data/containers/..."toSQUASH_FILE="/data/gharunners/containers/..."in the single-node (else) branch.IS_MULTINODE=trueon the same cluster.NGINX_SQUASH_FILEandSQUASH_FILEusing the old/data/containers/prefix.srtslurm.yaml(lines 97-101) under thecontainers:section.srtctl applyruns, it references squash files at/data/containers/which may no longer exist on the updated cluster.Addressing the counterargument
One reviewer noted that the multinode branch uses a different toolchain (
srt-slurm/srtctl) with different conventions (e.g.,+as sed separator vs_), and that the HF cache path at line 13 also lacks thegharunnersprefix (/models/vs/models/gharunners/). This is a fair observation — it's possiblesrt-slurmhas its own directory layout. However, the squash files are stored on the same physical filesystem regardless of which orchestration tool accesses them. The sed separator difference (+vs_) is about filename formatting, not directory structure. The/models/path at line 13 is a different mount entirely and doesn't speak to the/data/containers/vs/data/gharunners/containers/question.Impact and recommendation
If the directory migration is cluster-wide, multinode jobs will be broken. Even if there is a legitimate reason these paths should differ, the inconsistency should be explicitly addressed. The PR author should confirm whether lines 63, 67, and 72 also need updating to
/data/gharunners/containers/.