Skip to content

feat(validator): add --node-selector and --toleration flags for validation workload scheduling#444

Merged
mchmarny merged 11 commits intomainfrom
feat/validator-node-selector
Apr 2, 2026
Merged

feat(validator): add --node-selector and --toleration flags for validation workload scheduling#444
mchmarny merged 11 commits intomainfrom
feat/validator-node-selector

Conversation

@atif1996
Copy link
Copy Markdown
Contributor

@atif1996 atif1996 commented Mar 19, 2026

Summary

Adds --node-selector and --toleration flags to aicr validate, mirroring how the snapshot agent already handles scheduling. These flags now control the scheduling of validation workloads (NCCL benchmark workers, conformance test pods) onto the correct GPU nodes — critical for clusters with non-standard GPU node labels.

Also adds NCCL test support for B200 and GB200 GPUs on self-managed clusters (service=any).

Changes

Validation Workload Scheduling (--node-selector, --toleration)

  • --node-selector key=val overrides platform-specific GPU node selectors (e.g. cloud.google.com/gke-accelerator, node.kubernetes.io/instance-type) on inner workloads
  • --toleration key=val:Effect controls which taints the validation workloads tolerate
  • Flags are serialized as env vars (AICR_NODE_SELECTOR, AICR_TOLERATIONS) on the orchestrator Job and parsed in the validator container's Context
  • NCCL worker pods, gang scheduling pods, and DRA test pods all respect these overrides
  • Waits for Trainer controller-manager readiness after install before creating TrainingRuntime resources

B200 / GB200 Self-Managed Cluster Support

  • New runtime templates (testdata/b200/any/runtime.yaml, testdata/gb200/any/runtime.yaml) for InfiniBand/RDMA
  • New recipe overlays with NCCL bandwidth thresholds: B200 >= 350 GB/s, GB200 >= 720 GB/s
  • Fail-fast check requiring --node-selector for service=any (no default GPU label convention)

Test Results

Cluster Type Bandwidth Threshold Result
aicr-cuj1 AWS EKS H100 487.80 GB/s >= 300 PASS
aicr-demo4 GCP GKE H100 337.72 GB/s >= 250 PASS
kxphepuz GCP GKE H100 (2 nodes) 335.48 GB/s >= 250 PASS

Test plan

  • go test ./pkg/validator/job/... ./validators/... ./validators/performance/...
  • NCCL validation on AWS EKS H100 (aicr-cuj1) — PASS
  • NCCL validation on GCP GKE H100 (aicr-demo4, kxphepuz) — PASS

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Coverage Report ✅

Metric Value
Coverage 74.0%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-74.0%25-green)

Merging this branch changes the coverage (2 decrease, 2 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/cli 33.69% (-0.16%) 👎
github.com/NVIDIA/aicr/pkg/defaults 100.00% (ø)
github.com/NVIDIA/aicr/pkg/snapshotter 51.12% (+0.67%) 👍
github.com/NVIDIA/aicr/pkg/validator 34.47% (-0.34%) 👎
github.com/NVIDIA/aicr/pkg/validator/job 69.48% (+2.18%) 👍
github.com/NVIDIA/aicr/validators 0.00% (ø)
github.com/NVIDIA/aicr/validators/conformance 0.00% (ø)
github.com/NVIDIA/aicr/validators/performance 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/cli/validate.go 20.65% (-0.41%) 155 (+3) 32 123 (+3) 👎
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go 39.22% (+1.22%) 153 (+3) 60 (+3) 93 👍
github.com/NVIDIA/aicr/pkg/validator/job/deployer.go 59.32% (+4.45%) 118 (+5) 70 (+8) 48 (-3) 👍
github.com/NVIDIA/aicr/pkg/validator/options.go 87.50% (-12.50%) 16 (+2) 14 2 (+2) 💀
github.com/NVIDIA/aicr/pkg/validator/types.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/validator/validator.go 30.00% (ø) 190 57 133
github.com/NVIDIA/aicr/validators/conformance/dra_support_check.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/conformance/gang_scheduling_check.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/conformance/secure_access_check.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/context.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/nccl_all_reduce_bw_constraint.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/trainer_lifecycle.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@atif1996 atif1996 marked this pull request as draft March 20, 2026 02:38
@atif1996 atif1996 force-pushed the feat/validator-node-selector branch from 9bf3329 to 99d77dd Compare March 21, 2026 21:42
@atif1996 atif1996 changed the title feat(validator): pass node selector to validation Jobs for scheduling parity feat(validator): add --node-selector and --toleration flags for validation workload scheduling Mar 23, 2026
@atif1996 atif1996 marked this pull request as ready for review April 1, 2026 22:25
@atif1996 atif1996 requested a review from a team as a code owner April 1, 2026 22:25
atif1996 added 8 commits April 1, 2026 18:29
… parity

Validation Jobs previously ignored the --node-selector CLI flag, creating
friction for users operating in constrained environments with specific node
placement requirements. AICR recipes and bundles already expose nodeSelector,
tolerations, and affinity for workload scheduling, but validators did not
honor the same inputs.

This change threads --node-selector through the full validation pipeline:
CLI → Validator → Deployer → K8s Job PodSpec. When set, validator Jobs are
constrained to nodes matching the specified labels, achieving configuration
parity between recipe/bundle deployment and validation phases.

Changes:
- Add NodeSelector field to Validator struct and WithNodeSelector option
- Accept nodeSelector in job.NewDeployer and apply via PodSpec.NodeSelector
- Wire --node-selector through validationConfig to validator construction
- Move --node-selector and --toleration flags to "Scheduling" category
- Update flag description to clarify it applies to both agent and validators
- Add TestDeployJobNodeSelector and TestDeployJobNodeSelectorEmpty tests

Closes #443
…scheduling

The --node-selector and --toleration flags now target the inner validation
workloads (NCCL benchmark workers, conformance test pods) rather than the
validator orchestrator Job itself.

This is the correct layer: the orchestrator is a lightweight CPU pod already
handled by preferCPUNodeAffinityApply(). The real need is for clusters with
non-standard GPU node labels where hardcoded platform selectors don't match.

Data flow:
  CLI flags → Validator.NodeSelector/Tolerations
            → AICR_NODE_SELECTOR / AICR_TOLERATIONS env vars on orchestrator pod
            → validators.Context.NodeSelector / .Tolerations (parsed in LoadContext)
            → override nodeSelector/tolerations on NCCL worker pods and
              conformance test pods (gang scheduling, DRA)

Changes:
- pkg/validator/job/deployer: remove nodeSelector from orchestrator pod spec;
  serialize both fields as AICR_NODE_SELECTOR and AICR_TOLERATIONS env vars
- validators/context: add NodeSelector and Tolerations fields, parse from env vars
- validators/performance/nccl: post-process TrainingRuntime unstructured object to
  replace worker nodeSelector/tolerations when ctx overrides are set; refactor
  applyYAMLWithDynamicClient into parseYAMLTemplate + createUnstructured helpers
- validators/conformance: thread tolerations through deployGangTestResources and
  deployDRATestResources to buildGangTestPod and buildDRATestPod
- Tests: replace pod-spec nodeSelector tests with env var assertions; add
  context env var parsing tests; add overrideNCCLWorkerScheduling unit tests
- Docs: rewrite --node-selector and --toleration descriptions in cli-reference.md
  and validate.go; add Workload Scheduling section; document AICR_NODE_SELECTOR
  and AICR_TOLERATIONS env vars and ctx fields in contributor/validator.md
Pull platform-specific nodeSelectors and tolerations (EKS instance-type,
GKE gke-accelerator+gpu taint) out of runtime.yaml templates and into
platformWorkerScheduling() in Go. The effective scheduling (platform default
or user override) is now always applied via applyNCCLWorkerScheduling,
eliminating the conditional post-processing path.

Also removes ${INSTANCE_TYPE} from EKS templateData since it's no longer
needed in the YAML template.
The default "tolerate all" toleration ({Operator: Exists}) was serialized
as ":" which failed to parse in the validator container due to an empty
taint effect string. Use "*" as a sentinel value that both sides recognize.
After installing the Kubeflow Trainer operator, the ValidatingWebhookConfiguration
is registered before the controller-manager pod is ready to serve admission
requests. Poll the controller-manager Deployment until at least one replica is
ready before creating TrainingRuntime resources.
@atif1996 atif1996 force-pushed the feat/validator-node-selector branch from 9fd3352 to b73b135 Compare April 1, 2026 22:59
atif1996 and others added 3 commits April 1, 2026 19:00
- buildPodSpecApply: hardcode tolerate-all on orchestrator Job so user
  --toleration flags (targeting inner workloads) cannot prevent the
  orchestrator from scheduling on CPU nodes; remove buildTolerationsApply
- serializeNodeSelector: sort keys for deterministic AICR_NODE_SELECTOR output
- orchestratorEnvCount: update capacity hint from 6 to 8
- applyNCCLWorkerScheduling: return ErrCodeInternal when "node" replicatedJob
  is not found instead of silently succeeding
- applyNCCLResources: use ctx.NodeSelector != nil / ctx.Tolerations != nil
  instead of len() > 0 for consistency with conformance validators
- trainer_lifecycle.go: fix stdlib import ordering (time after strings)
- agent_test.go: add wildcard "*" test cases for ParseTolerations
@mchmarny mchmarny enabled auto-merge (squash) April 2, 2026 17:12
@mchmarny mchmarny merged commit 306cb9b into main Apr 2, 2026
39 of 42 checks passed
@mchmarny mchmarny deleted the feat/validator-node-selector branch April 2, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants