-
Notifications
You must be signed in to change notification settings - Fork 26
feat: Add AKS H100 NCCL All-Reduce performance runtime #442
Copy link
Copy link
Open
Labels
P2Minor defects; minor implications (no SLA commitment)Minor defects; minor implications (no SLA commitment)enhancementNew feature or requestNew feature or requestsize/M
Description
Problem
The h100-aks-ubuntu-training overlay references nccl-all-reduce-bw in its validation performance checks, added in #415, but there is no AKS TrainingRuntime template at validators/performance/testdata/h100/aks/runtime.yaml. The NCCL validator would fail at resource application time because it cannot find a service-specific runtime for AKS.
Currently only EKS and GKE have runtime templates:
validators/performance/testdata/h100/eks/runtime.yaml(EFA)validators/performance/testdata/h100/gke/runtime.yaml(TCPXO/FastRak)
What's needed
AKS H100 (ND H100 v5 / ND H200 v5) uses InfiniBand for GPU-to-GPU communication. The AKS runtime needs:
- IB-specific NCCL environment variables (
NCCL_IB_HCA,NCCL_IB_GID_INDEX, etc.) - No TCPXO sidecar (IB is kernel-native, no userspace daemon needed)
- AKS-specific node selector and tolerations
- Appropriate MPI network interface configuration for IB
Affected overlay
References
- AKS recipe overlays PR: feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays #415
- EKS runtime (EFA):
validators/performance/testdata/h100/eks/runtime.yaml - GKE runtime (TCPXO/FastRak):
validators/performance/testdata/h100/gke/runtime.yaml - NCCL constraint implementation:
validators/performance/nccl_all_reduce_bw_constraint.go
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Minor defects; minor implications (no SLA commitment)Minor defects; minor implications (no SLA commitment)enhancementNew feature or requestNew feature or requestsize/M