Skip to content

RFC: Dynamic GPU Orchestration Layer — From Day-Zero Recipes to Day-N Runtime Intelligence #464

@sanjeevrg89

Description

@sanjeevrg89

Summary

AICR's Recipe → Bundle → Deploy → Validate workflow solves day-zero cluster configuration. This RFC proposes extending AICR into day-N operations with three capabilities that build directly on the existing architecture: continuous drift detection, per-node recipe compliance validation, and fleet-wide observability.

These are features I intend to build as a contributor. Feedback on approach and scope is welcome.


Upstream Context

This proposal builds on existing upstream work and roadmap items:

  • ROADMAP.md P2 — Configuration Drift Detection: Listed as backlog (aicr diff, scheduled CronJob, alerting). This RFC provides a concrete design.
  • #448 — Inference performance validation with AIPerf: Continuous observability (Feature 3 below) would extend point-in-time AIPerf benchmarks into ongoing regression detection.
  • #442 — AKS NCCL performance runtime: Drift detection on AKS depends on complete platform runtime coverage.
  • NVIDIA DRA donation to CNCF (KubeCon Europe 2026): DRA changes how GPUs are allocated in K8s. Drift detection needs to account for DRA-based ResourceClaims alongside traditional device plugin resources.
  • KAI Scheduler: NVIDIA's open-source GPU scheduler already handles workload placement, gang scheduling, topology-aware scheduling, and fair-sharing. This RFC intentionally does not propose workload scheduling — KAI covers that. Instead, these features complement KAI by ensuring the underlying cluster stays correctly configured.

Feature 1: Continuous Drift Detection (aicr diff + Controller)

What

A Kubernetes controller that periodically captures snapshots and validates them against the deployed recipe, reporting and optionally remediating drift.

Why

AICR validates at deploy time. But clusters drift: operators get upgraded out-of-band, kernel parameters change after node replacements, Helm values diverge from recipe-specified state. Meta's LLaMA 3.1 training on 16,384 H100s experienced 419 unexpected failures over 54 days — about half GPU/HBM3 related. OpenAI's December 2024 outage was caused by a single telemetry deployment change cascading through Kubernetes. Configuration drift in GPU clusters causes silent performance degradation and expensive failures.

AICR already has all the building blocks — the snapshot collectors (pkg/collector/), constraint evaluation (pkg/constraints/), and four-phase validation (pkg/validator/). A drift detection controller is a reconciliation loop over these existing primitives.

How

Phase 1a — aicr diff CLI command:

  • Two modes: recipe-vs-snapshot (evaluate constraints) and snapshot-vs-snapshot (field-level comparison)
  • Recipe mode evaluates top-level constraints, component drift (via K8s.image subtype), and per-phase validation constraints
  • Uses existing constraints.Evaluate() — same code path as validator.checkReadiness()
  • Reports severity (error/warning) and remediation guidance from recipe
  • --fail-on-drift for CI/CD integration (non-zero exit on drift)
  • Output formats: YAML, JSON, table (custom human-readable format)

Phase 1b — AICRDriftWatch CRD + controller (future work):

  • CRD references a recipe and a schedule (cron expression)
  • Controller runs aicr snapshotaicr diff --recipe on schedule
  • Stores drift reports as Kubernetes Events and optional ConfigMap
  • Integrates with Prometheus AlertManager for alerting
  • Starts conservative: alert-only, no auto-remediation in v1

Extension points used:

  • pkg/snapshotter/agent.go — Agent deployment for periodic snapshots
  • pkg/constraints/evaluate.go — Constraint evaluation against snapshots
  • pkg/validator/phases.go — Phase-based validation execution
  • pkg/k8s/agent/deployer.go — RBAC and Job patterns for controller

Feature 2: Per-Node Recipe Compliance (aicr node-validate)

What

A command that validates individual GPU nodes against recipe constraints, labeling nodes with compliance status to gate workload scheduling. Designed to run as a DaemonSet alongside Skyhook, ensuring autoscaled nodes are recipe-compliant before accepting GPU workloads.

Why

When Cluster Autoscaler or Karpenter adds GPU nodes, there's no guarantee those nodes match the recipe's required driver version, GPU Operator config, or kernel parameters. Research shows GPU clusters achieve only 30-55% of theoretical performance, partly because scaled-out nodes often aren't configured identically to the original validated set.

AICR already validates node-level configuration via snapshot constraints. The gap is that this validation happens after deployment as a one-time check, not continuously as nodes join and change.

Why not a webhook?

AICR is a recipe-to-bundle translation system (aicrd is a stateless HTTP API with /v1/recipe, /v1/query, /v1/bundle endpoints). It has no webhook infrastructure, no TLS certificate management, and no admission controller framework. Adding a Kubernetes validating webhook would require entirely new infrastructure that doesn't fit AICR's architecture.

Instead, this uses the same pattern AICR already follows: agent-based validation. The snapshot agent (pkg/snapshotter/agent.go) already runs as a Kubernetes Job on GPU nodes. node-validate extends this pattern to continuous per-node compliance checking.

How

aicr node-validate command:

  • Runs on a GPU node (designed for DaemonSet or init container execution)
  • Captures a local node snapshot using existing collectors (pkg/collector/ — K8s, GPU, OS, SystemD, topology)
  • Evaluates recipe readiness constraints against the node snapshot (reuses constraints.Evaluate())
  • Labels nodes with aicr.nvidia.com/recipe-compliant=true|false via K8s API (strategic merge patch)
  • --interval flag for continuous loop mode (DaemonSet); --fail-on-drift for one-shot mode (init container)
  • --metrics-port exposes Prometheus /metrics endpoint in loop mode

Workload gating via labels (not taints):

node-validate uses node labels rather than direct taint management. This is a deliberate design choice — direct taint manipulation from node-validate would conflict with Skyhook's own runtimeRequiredTaint management. Instead:

  • node-validate sets aicr.nvidia.com/recipe-compliant=true|false on each node
  • GPU workloads gate on this label via nodeSelector:
    nodeSelector:
      aicr.nvidia.com/recipe-compliant: "true"
  • Or via Kueue ResourceFlavors that target compliant nodes only
  • Skyhook's workload gate taint operates independently — node-validate tolerates it (runs before taint removal) and provides a complementary compliance signal

Autoscaler integration flow:

Autoscaler adds node → GPU Operator installs drivers → Skyhook applies gate taint
  → node-validate DaemonSet pod starts → evaluates recipe constraints
  → labels node with aicr.nvidia.com/recipe-compliant
  → if compliant: workloads with matching nodeSelector can schedule
  → if non-compliant: label blocks workloads → slog/metrics report remediation

Extension points used:

  • pkg/collector/ — local node data collection (GPU, OS, SystemD, K8s, topology)
  • pkg/constraints/evaluate.go — constraint evaluation (same as aicr diff and validator.checkReadiness)
  • pkg/snapshotter/agent.go — agent mode execution pattern
  • pkg/k8s/client/ — node patching via K8s API

This does NOT replace KAI Scheduler or Kueue for workload scheduling. It ensures nodes are recipe-compliant before any scheduler places work on them.


Feature 3: Fleet-Wide Observability

What

Prometheus metrics for recipe compliance and drift detection, enabling cross-cluster monitoring via federation.

Why

AICR supports EKS, AKS, GKE, and self-managed clusters, but each cluster is validated independently. Organizations running GPU workloads across multiple clusters have no unified view of recipe compliance, drift status, or validation health. Kueue's MultiKueue enables multi-cluster job dispatching, but there's no multi-cluster configuration health view.

DCGM provides per-node GPU telemetry. AICR's kube-prometheus-stack and nvsentinel components are per-cluster. The gap is fleet-level aggregation of configuration health alongside GPU operational metrics.

How

Drift detection metrics (recorded by aicr diff and aicr node-validate):

  • aicr_recipe_constraint_status{name, severity} — per-constraint gauge (1=pass, 0=fail, -1=error)
  • aicr_component_drift_status{component, namespace} — per-component gauge (1=ok, 0=mismatch, -1=not-observed)
  • aicr_drift_check_total{mode, result} — counter of drift check runs
  • aicr_drift_last_check_timestamp_seconds — unix timestamp of last check
  • aicr_drift_check_duration_seconds — histogram of check durations
  • aicr_drift_constraints_passed / aicr_drift_constraints_failed — summary gauges
  • aicr_drift_components_ok / aicr_drift_components_drifted — summary gauges

Per-node validation metrics (recorded by aicr node-validate):

  • aicr_node_compliant{node} — 1=compliant, 0=non-compliant
  • aicr_node_constraints_passed{node} / aicr_node_constraints_failed{node} — per-node counts
  • aicr_node_validation_duration_seconds{node} — histogram
  • aicr_node_validation_total{node, result} — counter (compliant/non-compliant/error)

Integration:

  • Uses same promauto pattern and aicr_ prefix as existing AICR metrics (11 metrics in pkg/server/metrics.go, pkg/snapshotter/metrics.go, pkg/recipe/metrics.go)
  • --metrics-port on node-validate exposes /metrics endpoint for Prometheus scraping
  • ServiceMonitor included for auto-discovery by kube-prometheus-stack (configured with serviceMonitorSelectorNilUsesHelmValues: false)
  • Grafana dashboard provided for fleet compliance visualization
  • Cross-cluster federation: add external_labels: {cluster: "..."} in each cluster's Prometheus config; query across clusters via Thanos or Prometheus federation

Example fleet-wide queries:

  • count(aicr_node_compliant == 0) — non-compliant nodes across fleet
  • aicr_drift_constraints_failed > 0 — alert on constraint violations
  • rate(aicr_drift_check_total{result="drift"}[1h]) — drift frequency

Implementation Plan

Phase Scope Builds On
1a aicr diff CLI command pkg/constraints/, pkg/snapshotter/
1b Drift detection controller + CRD (future work) Phase 1a + pkg/k8s/agent/
2 aicr node-validate per-node compliance pkg/collector/, pkg/constraints/, pkg/k8s/client/
3 Fleet observability Prometheus metrics Phases 1-2 + kube-prometheus-stack

Each phase is independently useful and shippable.


What This RFC Intentionally Does NOT Propose

  • Workload scheduling or placement — KAI Scheduler handles this (topology-aware scheduling, gang scheduling, fair-sharing)
  • GPU virtualization or fractions — Run:ai / DRA / HAMi cover this
  • Multi-cloud job dispatching — SkyPilot and Kueue MultiKueue handle this
  • Cost optimization or FinOps — out of scope; this is configuration health, not billing
  • Kubernetes admission webhooks — AICR is a recipe/bundle system, not an operator; per-node validation via DaemonSet fits AICR's agent-based architecture
  • Direct taint management — node-validate uses labels for compliance signaling; taint lifecycle is managed by Skyhook to avoid conflicts

Open Questions

  1. For the drift controller (Phase 1b), should drift reports be stored as CRDs (queryable via kubectl) or ConfigMaps (simpler)?
  2. Should node-validate emit Kubernetes Events for constraint failures in addition to labels and metrics?

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions