Skip to content

feat: Add AKS H100 NCCL All-Reduce performance runtime #442

@xdu31

Description

@xdu31

Problem

The h100-aks-ubuntu-training overlay references nccl-all-reduce-bw in its validation performance checks, added in #415, but there is no AKS TrainingRuntime template at validators/performance/testdata/h100/aks/runtime.yaml. The NCCL validator would fail at resource application time because it cannot find a service-specific runtime for AKS.

Currently only EKS and GKE have runtime templates:

  • validators/performance/testdata/h100/eks/runtime.yaml (EFA)
  • validators/performance/testdata/h100/gke/runtime.yaml (TCPXO/FastRak)

What's needed

AKS H100 (ND H100 v5 / ND H200 v5) uses InfiniBand for GPU-to-GPU communication. The AKS runtime needs:

  • IB-specific NCCL environment variables (NCCL_IB_HCA, NCCL_IB_GID_INDEX, etc.)
  • No TCPXO sidecar (IB is kernel-native, no userspace daemon needed)
  • AKS-specific node selector and tolerations
  • Appropriate MPI network interface configuration for IB

Affected overlay

References

  • AKS recipe overlays PR: feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays #415
  • EKS runtime (EFA): validators/performance/testdata/h100/eks/runtime.yaml
  • GKE runtime (TCPXO/FastRak): validators/performance/testdata/h100/gke/runtime.yaml
  • NCCL constraint implementation: validators/performance/nccl_all_reduce_bw_constraint.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Minor defects; minor implications (no SLA commitment)enhancementNew feature or requestsize/M

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions