feat(validator): add --node-selector and --toleration flags for validation workload scheduling by atif1996 · Pull Request #444 · NVIDIA/aicr

atif1996 · 2026-03-19T23:43:54Z

Summary

Adds --node-selector and --toleration flags to aicr validate, mirroring how the snapshot agent already handles scheduling. These flags now control the scheduling of validation workloads (NCCL benchmark workers, conformance test pods) onto the correct GPU nodes — critical for clusters with non-standard GPU node labels.

Also adds NCCL test support for B200 and GB200 GPUs on self-managed clusters (service=any).

Changes

Validation Workload Scheduling (`--node-selector`, `--toleration`)

--node-selector key=val overrides platform-specific GPU node selectors (e.g. cloud.google.com/gke-accelerator, node.kubernetes.io/instance-type) on inner workloads
--toleration key=val:Effect controls which taints the validation workloads tolerate
Flags are serialized as env vars (AICR_NODE_SELECTOR, AICR_TOLERATIONS) on the orchestrator Job and parsed in the validator container's Context
NCCL worker pods, gang scheduling pods, and DRA test pods all respect these overrides
Waits for Trainer controller-manager readiness after install before creating TrainingRuntime resources

B200 / GB200 Self-Managed Cluster Support

New runtime templates (testdata/b200/any/runtime.yaml, testdata/gb200/any/runtime.yaml) for InfiniBand/RDMA
New recipe overlays with NCCL bandwidth thresholds: B200 >= 350 GB/s, GB200 >= 720 GB/s
Fail-fast check requiring --node-selector for service=any (no default GPU label convention)

Test Results

Cluster	Type	Bandwidth	Threshold	Result
aicr-cuj1	AWS EKS H100	487.80 GB/s	>= 300	PASS
aicr-demo4	GCP GKE H100	337.72 GB/s	>= 250	PASS
kxphepuz	GCP GKE H100 (2 nodes)	335.48 GB/s	>= 250	PASS

Test plan

go test ./pkg/validator/job/... ./validators/... ./validators/performance/...
NCCL validation on AWS EKS H100 (aicr-cuj1) — PASS
NCCL validation on GCP GKE H100 (aicr-demo4, kxphepuz) — PASS

github-actions · 2026-03-19T23:51:39Z

Coverage Report ✅

Metric	Value
Coverage	74.0%
Threshold	70%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-74.0%25-green)

Merging this branch changes the coverage (2 decrease, 2 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/aicr/pkg/cli	33.69% (-0.16%)	👎
github.com/NVIDIA/aicr/pkg/defaults	100.00% (ø)
github.com/NVIDIA/aicr/pkg/snapshotter	51.12% (+0.67%)	👍
github.com/NVIDIA/aicr/pkg/validator	34.47% (-0.34%)	👎
github.com/NVIDIA/aicr/pkg/validator/job	69.48% (+2.18%)	👍
github.com/NVIDIA/aicr/validators	0.00% (ø)
github.com/NVIDIA/aicr/validators/conformance	0.00% (ø)
github.com/NVIDIA/aicr/validators/performance	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/aicr/pkg/cli/validate.go	20.65% (-0.41%)	155 (+3)	32	123 (+3)	👎
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go	39.22% (+1.22%)	153 (+3)	60 (+3)	93	👍
github.com/NVIDIA/aicr/pkg/validator/job/deployer.go	59.32% (+4.45%)	118 (+5)	70 (+8)	48 (-3)	👍
github.com/NVIDIA/aicr/pkg/validator/options.go	87.50% (-12.50%)	16 (+2)	14	2 (+2)	💀
github.com/NVIDIA/aicr/pkg/validator/types.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/pkg/validator/validator.go	30.00% (ø)	190	57	133
github.com/NVIDIA/aicr/validators/conformance/dra_support_check.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/conformance/gang_scheduling_check.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/conformance/secure_access_check.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/context.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/performance/nccl_all_reduce_bw_constraint.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/performance/trainer_lifecycle.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

… parity Validation Jobs previously ignored the --node-selector CLI flag, creating friction for users operating in constrained environments with specific node placement requirements. AICR recipes and bundles already expose nodeSelector, tolerations, and affinity for workload scheduling, but validators did not honor the same inputs. This change threads --node-selector through the full validation pipeline: CLI → Validator → Deployer → K8s Job PodSpec. When set, validator Jobs are constrained to nodes matching the specified labels, achieving configuration parity between recipe/bundle deployment and validation phases. Changes: - Add NodeSelector field to Validator struct and WithNodeSelector option - Accept nodeSelector in job.NewDeployer and apply via PodSpec.NodeSelector - Wire --node-selector through validationConfig to validator construction - Move --node-selector and --toleration flags to "Scheduling" category - Update flag description to clarify it applies to both agent and validators - Add TestDeployJobNodeSelector and TestDeployJobNodeSelectorEmpty tests Closes #443

…scheduling The --node-selector and --toleration flags now target the inner validation workloads (NCCL benchmark workers, conformance test pods) rather than the validator orchestrator Job itself. This is the correct layer: the orchestrator is a lightweight CPU pod already handled by preferCPUNodeAffinityApply(). The real need is for clusters with non-standard GPU node labels where hardcoded platform selectors don't match. Data flow: CLI flags → Validator.NodeSelector/Tolerations → AICR_NODE_SELECTOR / AICR_TOLERATIONS env vars on orchestrator pod → validators.Context.NodeSelector / .Tolerations (parsed in LoadContext) → override nodeSelector/tolerations on NCCL worker pods and conformance test pods (gang scheduling, DRA) Changes: - pkg/validator/job/deployer: remove nodeSelector from orchestrator pod spec; serialize both fields as AICR_NODE_SELECTOR and AICR_TOLERATIONS env vars - validators/context: add NodeSelector and Tolerations fields, parse from env vars - validators/performance/nccl: post-process TrainingRuntime unstructured object to replace worker nodeSelector/tolerations when ctx overrides are set; refactor applyYAMLWithDynamicClient into parseYAMLTemplate + createUnstructured helpers - validators/conformance: thread tolerations through deployGangTestResources and deployDRATestResources to buildGangTestPod and buildDRATestPod - Tests: replace pod-spec nodeSelector tests with env var assertions; add context env var parsing tests; add overrideNCCLWorkerScheduling unit tests - Docs: rewrite --node-selector and --toleration descriptions in cli-reference.md and validate.go; add Workload Scheduling section; document AICR_NODE_SELECTOR and AICR_TOLERATIONS env vars and ctx fields in contributor/validator.md

Pull platform-specific nodeSelectors and tolerations (EKS instance-type, GKE gke-accelerator+gpu taint) out of runtime.yaml templates and into platformWorkerScheduling() in Go. The effective scheduling (platform default or user override) is now always applied via applyNCCLWorkerScheduling, eliminating the conditional post-processing path. Also removes ${INSTANCE_TYPE} from EKS templateData since it's no longer needed in the YAML template.

…vice=any)

The default "tolerate all" toleration ({Operator: Exists}) was serialized as ":" which failed to parse in the validator container due to an empty taint effect string. Use "*" as a sentinel value that both sides recognize.

After installing the Kubeflow Trainer operator, the ValidatingWebhookConfiguration is registered before the controller-manager pod is ready to serve admission requests. Poll the controller-manager Deployment until at least one replica is ready before creating TrainingRuntime resources.

- buildPodSpecApply: hardcode tolerate-all on orchestrator Job so user --toleration flags (targeting inner workloads) cannot prevent the orchestrator from scheduling on CPU nodes; remove buildTolerationsApply - serializeNodeSelector: sort keys for deterministic AICR_NODE_SELECTOR output - orchestratorEnvCount: update capacity hint from 6 to 8 - applyNCCLWorkerScheduling: return ErrCodeInternal when "node" replicatedJob is not found instead of silently succeeding - applyNCCLResources: use ctx.NodeSelector != nil / ctx.Tolerations != nil instead of len() > 0 for consistency with conformance validators - trainer_lifecycle.go: fix stdlib import ordering (time after strings) - agent_test.go: add wildcard "*" test cases for ParseTolerations

atif1996 requested a review from a team as a code owner March 19, 2026 23:43

github-actions bot added area/validator area/cli size/M labels Mar 19, 2026

mchmarny assigned atif1996 Mar 19, 2026

atif1996 marked this pull request as draft March 20, 2026 02:38

atif1996 force-pushed the feat/validator-node-selector branch from 9bf3329 to 99d77dd Compare March 21, 2026 21:42

github-actions bot added area/docs size/XL area/recipes and removed size/M labels Mar 21, 2026

atif1996 changed the title ~~feat(validator): pass node selector to validation Jobs for scheduling parity~~ feat(validator): add --node-selector and --toleration flags for validation workload scheduling Mar 23, 2026

atif1996 marked this pull request as ready for review April 1, 2026 22:25

atif1996 requested a review from a team as a code owner April 1, 2026 22:25

atif1996 added 8 commits April 1, 2026 18:29

test(nccl): add combined nodeSelector+tolerations scheduling test

42f6de7

fix: resolve lint issues across validator scheduling changes

c92a765

feat(nccl): add B200 and GB200 support for self-managed clusters (ser…

c6adee1

…vice=any)

atif1996 force-pushed the feat/validator-node-selector branch from 9fd3352 to b73b135 Compare April 1, 2026 22:59

atif1996 and others added 3 commits April 1, 2026 19:00

Merge branch 'main' into feat/validator-node-selector

f61fc6e

Merge branch 'main' into feat/validator-node-selector

88645b3

mchmarny approved these changes Apr 2, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) April 2, 2026 17:12

mchmarny merged commit 306cb9b into main Apr 2, 2026
39 of 42 checks passed

mchmarny deleted the feat/validator-node-selector branch April 2, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): add --node-selector and --toleration flags for validation workload scheduling#444

feat(validator): add --node-selector and --toleration flags for validation workload scheduling#444
mchmarny merged 11 commits intomainfrom
feat/validator-node-selector

atif1996 commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

Changed files (no unit tests)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

atif1996 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation Workload Scheduling (--node-selector, --toleration)

B200 / GB200 Self-Managed Cluster Support

Test Results

Test plan

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Merging this branch changes the coverage (2 decrease, 2 increase)

Changed files (no unit tests)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atif1996 commented Mar 19, 2026 •

edited

Loading

Validation Workload Scheduling (`--node-selector`, `--toleration`)

github-actions bot commented Mar 19, 2026 •

edited

Loading