Skip to content

fix: update directories on h200 dgxc cluster#902

Merged
cquil11 merged 1 commit intomainfrom
fix/change-h200-dgxc-data-dir
Mar 10, 2026
Merged

fix: update directories on h200 dgxc cluster#902
cquil11 merged 1 commit intomainfrom
fix/change-h200-dgxc-data-dir

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Mar 10, 2026

Updates were made to H200 DGXC physical cluster. This PR fixes accordingly.

@cquil11 cquil11 requested a review from a team March 10, 2026 16:41
@cquil11
Copy link
Collaborator Author

cquil11 commented Mar 10, 2026

@claude Test this PR by using runner model sweep on e2e test workflow on this runner

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 10, 2026

Claude finished @cquil11's task in 13m 1s —— View job


Testing PR with E2E Workflow

  • Gather context and understand PR changes
  • Determine correct runner-model-sweep parameters for H200 DGXC
  • Trigger e2e test workflow
  • Monitor workflow execution
  • Report results

Results

Both runs failed due to a SLURM cluster infrastructure issue, not related to this PR's change.

The PR change itself (updating SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/) is confirmed to be applied correctly in the runner script — the logs show the new path being used:

SQUASH_FILE=/data/gharunners/containers/lmsysorg_sglang_v0.5.9-cu129-amd64.sqsh

However, all jobs failed at the salloc step with:

salloc: error: Invalid generic resource (gres) specification

This error occurs at salloc --gres=gpu:8 (line 237 of launch_h200-dgxc-slurm.sh) before the squash file path is ever used, meaning the directory path change in this PR could not be validated end-to-end. The GRES error is a SLURM cluster configuration issue (the GPU resource specification format may have changed on the H200 DGXC cluster).

Summary

Run Status Details
#22913704019 Cancelled PR branch was merged/deleted before runners picked up jobs
#22913859708 Failed salloc: error: Invalid generic resource (gres) specification on all DGXC nodes

Recommendation: The SLURM GRES configuration on the H200 DGXC cluster may need to be investigated. The --gres=gpu:8 flag in launch_h200-dgxc-slurm.sh:237 may need to be updated to match the current cluster configuration (e.g., --gres=gpu:h200:8 or similar).


@cquil11 cquil11 merged commit 2537ff8 into main Mar 10, 2026
5 of 117 checks passed
@cquil11 cquil11 deleted the fix/change-h200-dgxc-data-dir branch March 10, 2026 16:47

HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache"
SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR updates the single-node SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ (line 231), but the multinode branch still references the old /data/containers/ path at lines 63, 67, and 72. Since both branches target the same H200 DGXC cluster and the PR title says "update directories on h200 dgxc cluster," these multinode paths likely need the same update or multinode jobs will fail to find the squash files.

Extended reasoning...

What the bug is

This PR is titled "fix: update directories on h200 dgxc cluster" and the description states "Updates were made to H200 DGXC physical cluster. This PR fixes accordingly." The change updates the single-node branch's SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ at line 231. However, three references to the old /data/containers/ path in the multinode (IS_MULTINODE=true) branch are left unchanged:

  • Line 63: NGINX_SQUASH_FILE="/data/containers/nginx+1.27.4.sqsh"
  • Line 67: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh"
  • Line 72: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"

How it manifests

Both branches operate on the same physical H200 DGXC Slurm cluster (same SLURM_PARTITION="main", same SLURM_ACCOUNT="sa-shared"). If the physical cluster directories were reorganized — which is what the PR title and description explicitly state — then the old /data/containers/ path may no longer exist or may not contain the expected squash files. Any multinode job launched after this change would fail when srtctl tries to map container images to squash files that don't exist at the old paths.

Step-by-step proof

  1. The PR changes line 231 from SQUASH_FILE="/data/containers/..." to SQUASH_FILE="/data/gharunners/containers/..." in the single-node (else) branch.
  2. A multinode job is submitted with IS_MULTINODE=true on the same cluster.
  3. The script reaches lines 63-72 and sets NGINX_SQUASH_FILE and SQUASH_FILE using the old /data/containers/ prefix.
  4. These paths are written into srtslurm.yaml (lines 97-101) under the containers: section.
  5. When srtctl apply runs, it references squash files at /data/containers/ which may no longer exist on the updated cluster.
  6. The multinode job fails because the container squash files cannot be found.

Addressing the counterargument

One reviewer noted that the multinode branch uses a different toolchain (srt-slurm/srtctl) with different conventions (e.g., + as sed separator vs _), and that the HF cache path at line 13 also lacks the gharunners prefix (/models/ vs /models/gharunners/). This is a fair observation — it's possible srt-slurm has its own directory layout. However, the squash files are stored on the same physical filesystem regardless of which orchestration tool accesses them. The sed separator difference (+ vs _) is about filename formatting, not directory structure. The /models/ path at line 13 is a different mount entirely and doesn't speak to the /data/containers/ vs /data/gharunners/containers/ question.

Impact and recommendation

If the directory migration is cluster-wide, multinode jobs will be broken. Even if there is a legitimate reason these paths should differ, the inconsistency should be explicitly addressed. The PR author should confirm whether lines 63, 67, and 72 also need updating to /data/gharunners/containers/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants