-
Notifications
You must be signed in to change notification settings - Fork 26
RFC: Dynamic GPU Orchestration Layer — From Day-Zero Recipes to Day-N Runtime Intelligence #464
Description
Summary
AICR's Recipe → Bundle → Deploy → Validate workflow solves day-zero cluster configuration. This RFC proposes extending AICR into day-N operations with three capabilities that build directly on the existing architecture: continuous drift detection, per-node recipe compliance validation, and fleet-wide observability.
These are features I intend to build as a contributor. Feedback on approach and scope is welcome.
Upstream Context
This proposal builds on existing upstream work and roadmap items:
- ROADMAP.md P2 — Configuration Drift Detection: Listed as backlog (
aicr diff, scheduled CronJob, alerting). This RFC provides a concrete design. - #448 — Inference performance validation with AIPerf: Continuous observability (Feature 3 below) would extend point-in-time AIPerf benchmarks into ongoing regression detection.
- #442 — AKS NCCL performance runtime: Drift detection on AKS depends on complete platform runtime coverage.
- NVIDIA DRA donation to CNCF (KubeCon Europe 2026): DRA changes how GPUs are allocated in K8s. Drift detection needs to account for DRA-based ResourceClaims alongside traditional device plugin resources.
- KAI Scheduler: NVIDIA's open-source GPU scheduler already handles workload placement, gang scheduling, topology-aware scheduling, and fair-sharing. This RFC intentionally does not propose workload scheduling — KAI covers that. Instead, these features complement KAI by ensuring the underlying cluster stays correctly configured.
Feature 1: Continuous Drift Detection (aicr diff + Controller)
What
A Kubernetes controller that periodically captures snapshots and validates them against the deployed recipe, reporting and optionally remediating drift.
Why
AICR validates at deploy time. But clusters drift: operators get upgraded out-of-band, kernel parameters change after node replacements, Helm values diverge from recipe-specified state. Meta's LLaMA 3.1 training on 16,384 H100s experienced 419 unexpected failures over 54 days — about half GPU/HBM3 related. OpenAI's December 2024 outage was caused by a single telemetry deployment change cascading through Kubernetes. Configuration drift in GPU clusters causes silent performance degradation and expensive failures.
AICR already has all the building blocks — the snapshot collectors (pkg/collector/), constraint evaluation (pkg/constraints/), and four-phase validation (pkg/validator/). A drift detection controller is a reconciliation loop over these existing primitives.
How
Phase 1a — aicr diff CLI command:
- Two modes: recipe-vs-snapshot (evaluate constraints) and snapshot-vs-snapshot (field-level comparison)
- Recipe mode evaluates top-level constraints, component drift (via
K8s.imagesubtype), and per-phase validation constraints - Uses existing
constraints.Evaluate()— same code path asvalidator.checkReadiness() - Reports severity (error/warning) and remediation guidance from recipe
--fail-on-driftfor CI/CD integration (non-zero exit on drift)- Output formats: YAML, JSON, table (custom human-readable format)
Phase 1b — AICRDriftWatch CRD + controller (future work):
- CRD references a recipe and a schedule (cron expression)
- Controller runs
aicr snapshot→aicr diff --recipeon schedule - Stores drift reports as Kubernetes Events and optional ConfigMap
- Integrates with Prometheus AlertManager for alerting
- Starts conservative: alert-only, no auto-remediation in v1
Extension points used:
pkg/snapshotter/agent.go— Agent deployment for periodic snapshotspkg/constraints/evaluate.go— Constraint evaluation against snapshotspkg/validator/phases.go— Phase-based validation executionpkg/k8s/agent/deployer.go— RBAC and Job patterns for controller
Feature 2: Per-Node Recipe Compliance (aicr node-validate)
What
A command that validates individual GPU nodes against recipe constraints, labeling nodes with compliance status to gate workload scheduling. Designed to run as a DaemonSet alongside Skyhook, ensuring autoscaled nodes are recipe-compliant before accepting GPU workloads.
Why
When Cluster Autoscaler or Karpenter adds GPU nodes, there's no guarantee those nodes match the recipe's required driver version, GPU Operator config, or kernel parameters. Research shows GPU clusters achieve only 30-55% of theoretical performance, partly because scaled-out nodes often aren't configured identically to the original validated set.
AICR already validates node-level configuration via snapshot constraints. The gap is that this validation happens after deployment as a one-time check, not continuously as nodes join and change.
Why not a webhook?
AICR is a recipe-to-bundle translation system (aicrd is a stateless HTTP API with /v1/recipe, /v1/query, /v1/bundle endpoints). It has no webhook infrastructure, no TLS certificate management, and no admission controller framework. Adding a Kubernetes validating webhook would require entirely new infrastructure that doesn't fit AICR's architecture.
Instead, this uses the same pattern AICR already follows: agent-based validation. The snapshot agent (pkg/snapshotter/agent.go) already runs as a Kubernetes Job on GPU nodes. node-validate extends this pattern to continuous per-node compliance checking.
How
aicr node-validate command:
- Runs on a GPU node (designed for DaemonSet or init container execution)
- Captures a local node snapshot using existing collectors (
pkg/collector/— K8s, GPU, OS, SystemD, topology) - Evaluates recipe readiness constraints against the node snapshot (reuses
constraints.Evaluate()) - Labels nodes with
aicr.nvidia.com/recipe-compliant=true|falsevia K8s API (strategic merge patch) --intervalflag for continuous loop mode (DaemonSet);--fail-on-driftfor one-shot mode (init container)--metrics-portexposes Prometheus/metricsendpoint in loop mode
Workload gating via labels (not taints):
node-validate uses node labels rather than direct taint management. This is a deliberate design choice — direct taint manipulation from node-validate would conflict with Skyhook's own runtimeRequiredTaint management. Instead:
node-validatesetsaicr.nvidia.com/recipe-compliant=true|falseon each node- GPU workloads gate on this label via
nodeSelector:nodeSelector: aicr.nvidia.com/recipe-compliant: "true"
- Or via Kueue ResourceFlavors that target compliant nodes only
- Skyhook's workload gate taint operates independently —
node-validatetolerates it (runs before taint removal) and provides a complementary compliance signal
Autoscaler integration flow:
Autoscaler adds node → GPU Operator installs drivers → Skyhook applies gate taint
→ node-validate DaemonSet pod starts → evaluates recipe constraints
→ labels node with aicr.nvidia.com/recipe-compliant
→ if compliant: workloads with matching nodeSelector can schedule
→ if non-compliant: label blocks workloads → slog/metrics report remediation
Extension points used:
pkg/collector/— local node data collection (GPU, OS, SystemD, K8s, topology)pkg/constraints/evaluate.go— constraint evaluation (same asaicr diffandvalidator.checkReadiness)pkg/snapshotter/agent.go— agent mode execution patternpkg/k8s/client/— node patching via K8s API
This does NOT replace KAI Scheduler or Kueue for workload scheduling. It ensures nodes are recipe-compliant before any scheduler places work on them.
Feature 3: Fleet-Wide Observability
What
Prometheus metrics for recipe compliance and drift detection, enabling cross-cluster monitoring via federation.
Why
AICR supports EKS, AKS, GKE, and self-managed clusters, but each cluster is validated independently. Organizations running GPU workloads across multiple clusters have no unified view of recipe compliance, drift status, or validation health. Kueue's MultiKueue enables multi-cluster job dispatching, but there's no multi-cluster configuration health view.
DCGM provides per-node GPU telemetry. AICR's kube-prometheus-stack and nvsentinel components are per-cluster. The gap is fleet-level aggregation of configuration health alongside GPU operational metrics.
How
Drift detection metrics (recorded by aicr diff and aicr node-validate):
aicr_recipe_constraint_status{name, severity}— per-constraint gauge (1=pass, 0=fail, -1=error)aicr_component_drift_status{component, namespace}— per-component gauge (1=ok, 0=mismatch, -1=not-observed)aicr_drift_check_total{mode, result}— counter of drift check runsaicr_drift_last_check_timestamp_seconds— unix timestamp of last checkaicr_drift_check_duration_seconds— histogram of check durationsaicr_drift_constraints_passed/aicr_drift_constraints_failed— summary gaugesaicr_drift_components_ok/aicr_drift_components_drifted— summary gauges
Per-node validation metrics (recorded by aicr node-validate):
aicr_node_compliant{node}— 1=compliant, 0=non-compliantaicr_node_constraints_passed{node}/aicr_node_constraints_failed{node}— per-node countsaicr_node_validation_duration_seconds{node}— histogramaicr_node_validation_total{node, result}— counter (compliant/non-compliant/error)
Integration:
- Uses same
promautopattern andaicr_prefix as existing AICR metrics (11 metrics inpkg/server/metrics.go,pkg/snapshotter/metrics.go,pkg/recipe/metrics.go) --metrics-portonnode-validateexposes/metricsendpoint for Prometheus scraping- ServiceMonitor included for auto-discovery by
kube-prometheus-stack(configured withserviceMonitorSelectorNilUsesHelmValues: false) - Grafana dashboard provided for fleet compliance visualization
- Cross-cluster federation: add
external_labels: {cluster: "..."}in each cluster's Prometheus config; query across clusters via Thanos or Prometheus federation
Example fleet-wide queries:
count(aicr_node_compliant == 0)— non-compliant nodes across fleetaicr_drift_constraints_failed > 0— alert on constraint violationsrate(aicr_drift_check_total{result="drift"}[1h])— drift frequency
Implementation Plan
| Phase | Scope | Builds On |
|---|---|---|
| 1a | aicr diff CLI command |
pkg/constraints/, pkg/snapshotter/ |
| 1b | Drift detection controller + CRD (future work) | Phase 1a + pkg/k8s/agent/ |
| 2 | aicr node-validate per-node compliance |
pkg/collector/, pkg/constraints/, pkg/k8s/client/ |
| 3 | Fleet observability Prometheus metrics | Phases 1-2 + kube-prometheus-stack |
Each phase is independently useful and shippable.
What This RFC Intentionally Does NOT Propose
- Workload scheduling or placement — KAI Scheduler handles this (topology-aware scheduling, gang scheduling, fair-sharing)
- GPU virtualization or fractions — Run:ai / DRA / HAMi cover this
- Multi-cloud job dispatching — SkyPilot and Kueue MultiKueue handle this
- Cost optimization or FinOps — out of scope; this is configuration health, not billing
- Kubernetes admission webhooks — AICR is a recipe/bundle system, not an operator; per-node validation via DaemonSet fits AICR's agent-based architecture
- Direct taint management — node-validate uses labels for compliance signaling; taint lifecycle is managed by Skyhook to avoid conflicts
Open Questions
- For the drift controller (Phase 1b), should drift reports be stored as CRDs (queryable via kubectl) or ConfigMaps (simpler)?
- Should
node-validateemit Kubernetes Events for constraint failures in addition to labels and metrics?
References
- AICR ROADMAP.md — P2 Drift Detection
- Meta LLaMA 3.1 training failures — 419 interruptions over 54 days
- OpenAI Kubernetes outage — December 2024
- GPU cluster performance: 30-55% of theoretical (PEARC 2025)
- NVIDIA DRA Driver donated to CNCF — KubeCon Europe 2026
- NVIDIA KAI Scheduler
- Kueue Topology-Aware Scheduling
- AICR Design Principles