Conversation
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (2 decrease, 2 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
9bf3329 to
99d77dd
Compare
… parity Validation Jobs previously ignored the --node-selector CLI flag, creating friction for users operating in constrained environments with specific node placement requirements. AICR recipes and bundles already expose nodeSelector, tolerations, and affinity for workload scheduling, but validators did not honor the same inputs. This change threads --node-selector through the full validation pipeline: CLI → Validator → Deployer → K8s Job PodSpec. When set, validator Jobs are constrained to nodes matching the specified labels, achieving configuration parity between recipe/bundle deployment and validation phases. Changes: - Add NodeSelector field to Validator struct and WithNodeSelector option - Accept nodeSelector in job.NewDeployer and apply via PodSpec.NodeSelector - Wire --node-selector through validationConfig to validator construction - Move --node-selector and --toleration flags to "Scheduling" category - Update flag description to clarify it applies to both agent and validators - Add TestDeployJobNodeSelector and TestDeployJobNodeSelectorEmpty tests Closes #443
…scheduling
The --node-selector and --toleration flags now target the inner validation
workloads (NCCL benchmark workers, conformance test pods) rather than the
validator orchestrator Job itself.
This is the correct layer: the orchestrator is a lightweight CPU pod already
handled by preferCPUNodeAffinityApply(). The real need is for clusters with
non-standard GPU node labels where hardcoded platform selectors don't match.
Data flow:
CLI flags → Validator.NodeSelector/Tolerations
→ AICR_NODE_SELECTOR / AICR_TOLERATIONS env vars on orchestrator pod
→ validators.Context.NodeSelector / .Tolerations (parsed in LoadContext)
→ override nodeSelector/tolerations on NCCL worker pods and
conformance test pods (gang scheduling, DRA)
Changes:
- pkg/validator/job/deployer: remove nodeSelector from orchestrator pod spec;
serialize both fields as AICR_NODE_SELECTOR and AICR_TOLERATIONS env vars
- validators/context: add NodeSelector and Tolerations fields, parse from env vars
- validators/performance/nccl: post-process TrainingRuntime unstructured object to
replace worker nodeSelector/tolerations when ctx overrides are set; refactor
applyYAMLWithDynamicClient into parseYAMLTemplate + createUnstructured helpers
- validators/conformance: thread tolerations through deployGangTestResources and
deployDRATestResources to buildGangTestPod and buildDRATestPod
- Tests: replace pod-spec nodeSelector tests with env var assertions; add
context env var parsing tests; add overrideNCCLWorkerScheduling unit tests
- Docs: rewrite --node-selector and --toleration descriptions in cli-reference.md
and validate.go; add Workload Scheduling section; document AICR_NODE_SELECTOR
and AICR_TOLERATIONS env vars and ctx fields in contributor/validator.md
Pull platform-specific nodeSelectors and tolerations (EKS instance-type,
GKE gke-accelerator+gpu taint) out of runtime.yaml templates and into
platformWorkerScheduling() in Go. The effective scheduling (platform default
or user override) is now always applied via applyNCCLWorkerScheduling,
eliminating the conditional post-processing path.
Also removes ${INSTANCE_TYPE} from EKS templateData since it's no longer
needed in the YAML template.
The default "tolerate all" toleration ({Operator: Exists}) was serialized
as ":" which failed to parse in the validator container due to an empty
taint effect string. Use "*" as a sentinel value that both sides recognize.
After installing the Kubeflow Trainer operator, the ValidatingWebhookConfiguration is registered before the controller-manager pod is ready to serve admission requests. Poll the controller-manager Deployment until at least one replica is ready before creating TrainingRuntime resources.
9fd3352 to
b73b135
Compare
- buildPodSpecApply: hardcode tolerate-all on orchestrator Job so user --toleration flags (targeting inner workloads) cannot prevent the orchestrator from scheduling on CPU nodes; remove buildTolerationsApply - serializeNodeSelector: sort keys for deterministic AICR_NODE_SELECTOR output - orchestratorEnvCount: update capacity hint from 6 to 8 - applyNCCLWorkerScheduling: return ErrCodeInternal when "node" replicatedJob is not found instead of silently succeeding - applyNCCLResources: use ctx.NodeSelector != nil / ctx.Tolerations != nil instead of len() > 0 for consistency with conformance validators - trainer_lifecycle.go: fix stdlib import ordering (time after strings) - agent_test.go: add wildcard "*" test cases for ParseTolerations
mchmarny
approved these changes
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
--node-selectorand--tolerationflags toaicr validate, mirroring how the snapshot agent already handles scheduling. These flags now control the scheduling of validation workloads (NCCL benchmark workers, conformance test pods) onto the correct GPU nodes — critical for clusters with non-standard GPU node labels.Also adds NCCL test support for B200 and GB200 GPUs on self-managed clusters (
service=any).Changes
Validation Workload Scheduling (
--node-selector,--toleration)--node-selector key=valoverrides platform-specific GPU node selectors (e.g.cloud.google.com/gke-accelerator,node.kubernetes.io/instance-type) on inner workloads--toleration key=val:Effectcontrols which taints the validation workloads tolerateAICR_NODE_SELECTOR,AICR_TOLERATIONS) on the orchestrator Job and parsed in the validator container'sContextB200 / GB200 Self-Managed Cluster Support
testdata/b200/any/runtime.yaml,testdata/gb200/any/runtime.yaml) for InfiniBand/RDMA--node-selectorforservice=any(no default GPU label convention)Test Results
Test plan
go test ./pkg/validator/job/... ./validators/... ./validators/performance/...