fix: update directories on h200 dgxc cluster by cquil11 · Pull Request #902 · SemiAnalysisAI/InferenceX

cquil11 · 2026-03-10T16:41:38Z

Updates were made to H200 DGXC physical cluster. This PR fixes accordingly.

cquil11 · 2026-03-10T16:42:19Z

@claude Test this PR by using runner model sweep on e2e test workflow on this runner

Klaud-Cold · 2026-03-10T16:42:44Z

Claude finished @cquil11's task in 13m 1s —— View job

Testing PR with E2E Workflow

Gather context and understand PR changes
Determine correct runner-model-sweep parameters for H200 DGXC
Trigger e2e test workflow
Monitor workflow execution
Report results

Results

Both runs failed due to a SLURM cluster infrastructure issue, not related to this PR's change.

The PR change itself (updating SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/) is confirmed to be applied correctly in the runner script — the logs show the new path being used:

SQUASH_FILE=/data/gharunners/containers/lmsysorg_sglang_v0.5.9-cu129-amd64.sqsh

However, all jobs failed at the salloc step with:

salloc: error: Invalid generic resource (gres) specification

This error occurs at salloc --gres=gpu:8 (line 237 of launch_h200-dgxc-slurm.sh) before the squash file path is ever used, meaning the directory path change in this PR could not be validated end-to-end. The GRES error is a SLURM cluster configuration issue (the GPU resource specification format may have changed on the H200 DGXC cluster).

Summary

Run	Status	Details
#22913704019	Cancelled	PR branch was merged/deleted before runners picked up jobs
#22913859708	Failed	`salloc: error: Invalid generic resource (gres) specification` on all DGXC nodes

Recommendation: The SLURM GRES configuration on the H200 DGXC cluster may need to be investigated. The --gres=gpu:8 flag in launch_h200-dgxc-slurm.sh:237 may need to be updated to match the current cluster configuration (e.g., --gres=gpu:h200:8 or similar).

claude · 2026-03-10T16:53:34Z

runners/launch_h200-dgxc-slurm.sh


    HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache"
-    SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"


🔴 The PR updates the single-node SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ (line 231), but the multinode branch still references the old /data/containers/ path at lines 63, 67, and 72. Since both branches target the same H200 DGXC cluster and the PR title says "update directories on h200 dgxc cluster," these multinode paths likely need the same update or multinode jobs will fail to find the squash files.

Extended reasoning...

What the bug is

This PR is titled "fix: update directories on h200 dgxc cluster" and the description states "Updates were made to H200 DGXC physical cluster. This PR fixes accordingly." The change updates the single-node branch's SQUASH_FILE path from /data/containers/ to /data/gharunners/containers/ at line 231. However, three references to the old /data/containers/ path in the multinode (IS_MULTINODE=true) branch are left unchanged:

Line 63: NGINX_SQUASH_FILE="/data/containers/nginx+1.27.4.sqsh"

Line 67: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh"

Line 72: SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"

How it manifests

Both branches operate on the same physical H200 DGXC Slurm cluster (same SLURM_PARTITION="main", same SLURM_ACCOUNT="sa-shared"). If the physical cluster directories were reorganized — which is what the PR title and description explicitly state — then the old /data/containers/ path may no longer exist or may not contain the expected squash files. Any multinode job launched after this change would fail when srtctl tries to map container images to squash files that don't exist at the old paths.

Step-by-step proof

The PR changes line 231 from SQUASH_FILE="/data/containers/..." to SQUASH_FILE="/data/gharunners/containers/..." in the single-node (else) branch.

A multinode job is submitted with IS_MULTINODE=true on the same cluster.

The script reaches lines 63-72 and sets NGINX_SQUASH_FILE and SQUASH_FILE using the old /data/containers/ prefix.

These paths are written into srtslurm.yaml (lines 97-101) under the containers: section.

When srtctl apply runs, it references squash files at /data/containers/ which may no longer exist on the updated cluster.

The multinode job fails because the container squash files cannot be found.

Addressing the counterargument

One reviewer noted that the multinode branch uses a different toolchain (srt-slurm/srtctl) with different conventions (e.g., + as sed separator vs _), and that the HF cache path at line 13 also lacks the gharunners prefix (/models/ vs /models/gharunners/). This is a fair observation — it's possible srt-slurm has its own directory layout. However, the squash files are stored on the same physical filesystem regardless of which orchestration tool accesses them. The sed separator difference (+ vs _) is about filename formatting, not directory structure. The /models/ path at line 13 is a different mount entirely and doesn't speak to the /data/containers/ vs /data/gharunners/containers/ question.

Impact and recommendation

If the directory migration is cluster-wide, multinode jobs will be broken. Even if there is a legitimate reason these paths should differ, the inconsistency should be explicitly addressed. The PR author should confirm whether lines 63, 67, and 72 also need updating to /data/gharunners/containers/.

fix: update directories on h200 dgxc cluster

eb05516

cquil11 requested a review from a team March 10, 2026 16:41

github-project-automation bot added this to InferenceMAX Board Mar 10, 2026

cquil11 merged commit 2537ff8 into main Mar 10, 2026
5 of 117 checks passed

cquil11 deleted the fix/change-h200-dgxc-data-dir branch March 10, 2026 16:47

github-project-automation bot moved this to Done in InferenceMAX Board Mar 10, 2026

claude bot reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update directories on h200 dgxc cluster#902

fix: update directories on h200 dgxc cluster#902
cquil11 merged 1 commit intomainfrom
fix/change-h200-dgxc-data-dir

cquil11 commented Mar 10, 2026

Uh oh!

cquil11 commented Mar 10, 2026

Uh oh!

Klaud-Cold commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Mar 10, 2026

Uh oh!

cquil11 commented Mar 10, 2026

Uh oh!

Klaud-Cold commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing PR with E2E Workflow

Results

Summary

Uh oh!

Uh oh!

claude bot Mar 10, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Step-by-step proof

Addressing the counterargument

Impact and recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Klaud-Cold commented Mar 10, 2026 •

edited

Loading