RFC: Dynamic GPU Orchestration Layer — From Day-Zero Recipes to Day-N Runtime Intelligence

## Summary

AICR's Recipe → Bundle → Deploy → Validate workflow solves day-zero cluster configuration. This RFC proposes extending AICR into **day-N operations** with three capabilities that build directly on the existing architecture: continuous drift detection, per-node recipe compliance validation, and fleet-wide observability.

These are features I intend to build as a contributor. Feedback on approach and scope is welcome.

---

## Upstream Context

This proposal builds on existing upstream work and roadmap items:

- **[ROADMAP.md P2 — Configuration Drift Detection](https://github.com/NVIDIA/aicr/blob/main/ROADMAP.md):** Listed as backlog (`aicr diff`, scheduled CronJob, alerting). This RFC provides a concrete design.
- **[#448](https://github.com/NVIDIA/aicr/issues/448) — Inference performance validation with AIPerf:** Continuous observability (Feature 3 below) would extend point-in-time AIPerf benchmarks into ongoing regression detection.
- **[#442](https://github.com/NVIDIA/aicr/issues/442) — AKS NCCL performance runtime:** Drift detection on AKS depends on complete platform runtime coverage.
- **[NVIDIA DRA donation to CNCF (KubeCon Europe 2026)](https://blogs.nvidia.com/blog/nvidia-at-kubecon-2026/):** DRA changes how GPUs are allocated in K8s. Drift detection needs to account for DRA-based ResourceClaims alongside traditional device plugin resources.
- **[KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler):** NVIDIA's open-source GPU scheduler already handles workload placement, gang scheduling, topology-aware scheduling, and fair-sharing. This RFC intentionally does **not** propose workload scheduling — KAI covers that. Instead, these features complement KAI by ensuring the underlying cluster stays correctly configured.

---

## Feature 1: Continuous Drift Detection (`aicr diff` + Controller)

### What

A Kubernetes controller that periodically captures snapshots and validates them against the deployed recipe, reporting and optionally remediating drift.

### Why

AICR validates at deploy time. But clusters drift: operators get upgraded out-of-band, kernel parameters change after node replacements, Helm values diverge from recipe-specified state. Meta's LLaMA 3.1 training on 16,384 H100s experienced [419 unexpected failures over 54 days](https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/) — about half GPU/HBM3 related. OpenAI's [December 2024 outage](https://www.chkk.io/blog/openais-outage-the-complexity-and-fragility-of-modern-ai-infrastructure-on-kubernetes) was caused by a single telemetry deployment change cascading through Kubernetes. Configuration drift in GPU clusters causes silent performance degradation and expensive failures.

AICR already has all the building blocks — the snapshot collectors (`pkg/collector/`), constraint evaluation (`pkg/constraints/`), and four-phase validation (`pkg/validator/`). A drift detection controller is a reconciliation loop over these existing primitives.

### How

**Phase 1a — `aicr diff` CLI command:**
- Two modes: recipe-vs-snapshot (evaluate constraints) and snapshot-vs-snapshot (field-level comparison)
- Recipe mode evaluates top-level constraints, component drift (via `K8s.image` subtype), and per-phase validation constraints
- Uses existing `constraints.Evaluate()` — same code path as `validator.checkReadiness()`
- Reports severity (error/warning) and remediation guidance from recipe
- `--fail-on-drift` for CI/CD integration (non-zero exit on drift)
- Output formats: YAML, JSON, table (custom human-readable format)

**Phase 1b — `AICRDriftWatch` CRD + controller (future work):**
- CRD references a recipe and a schedule (cron expression)
- Controller runs `aicr snapshot` → `aicr diff --recipe` on schedule
- Stores drift reports as Kubernetes Events and optional ConfigMap
- Integrates with Prometheus AlertManager for alerting
- Starts conservative: alert-only, no auto-remediation in v1

**Extension points used:**
- `pkg/snapshotter/agent.go` — Agent deployment for periodic snapshots
- `pkg/constraints/evaluate.go` — Constraint evaluation against snapshots
- `pkg/validator/phases.go` — Phase-based validation execution
- `pkg/k8s/agent/deployer.go` — RBAC and Job patterns for controller

---

## Feature 2: Per-Node Recipe Compliance (`aicr node-validate`)

### What

A command that validates individual GPU nodes against recipe constraints, labeling nodes with compliance status to gate workload scheduling. Designed to run as a DaemonSet alongside Skyhook, ensuring autoscaled nodes are recipe-compliant before accepting GPU workloads.

### Why

When Cluster Autoscaler or Karpenter adds GPU nodes, there's no guarantee those nodes match the recipe's required driver version, GPU Operator config, or kernel parameters. [Research shows GPU clusters achieve only 30-55% of theoretical performance](https://dl.acm.org/doi/10.1145/3708035.3736010), partly because scaled-out nodes often aren't configured identically to the original validated set.

AICR already validates node-level configuration via snapshot constraints. The gap is that this validation happens after deployment as a one-time check, not continuously as nodes join and change.

### Why not a webhook?

AICR is a recipe-to-bundle translation system (`aicrd` is a stateless HTTP API with `/v1/recipe`, `/v1/query`, `/v1/bundle` endpoints). It has no webhook infrastructure, no TLS certificate management, and no admission controller framework. Adding a Kubernetes validating webhook would require entirely new infrastructure that doesn't fit AICR's architecture.

Instead, this uses the same pattern AICR already follows: **agent-based validation**. The snapshot agent (`pkg/snapshotter/agent.go`) already runs as a Kubernetes Job on GPU nodes. `node-validate` extends this pattern to continuous per-node compliance checking.

### How

**`aicr node-validate` command:**
- Runs on a GPU node (designed for DaemonSet or init container execution)
- Captures a local node snapshot using existing collectors (`pkg/collector/` — K8s, GPU, OS, SystemD, topology)
- Evaluates recipe readiness constraints against the node snapshot (reuses `constraints.Evaluate()`)
- Labels nodes with `aicr.nvidia.com/recipe-compliant=true|false` via K8s API (strategic merge patch)
- `--interval` flag for continuous loop mode (DaemonSet); `--fail-on-drift` for one-shot mode (init container)
- `--metrics-port` exposes Prometheus `/metrics` endpoint in loop mode

**Workload gating via labels (not taints):**

`node-validate` uses node labels rather than direct taint management. This is a deliberate design choice — direct taint manipulation from node-validate would conflict with Skyhook's own `runtimeRequiredTaint` management. Instead:

- `node-validate` sets `aicr.nvidia.com/recipe-compliant=true|false` on each node
- GPU workloads gate on this label via `nodeSelector`:
  ```yaml
  nodeSelector:
    aicr.nvidia.com/recipe-compliant: "true"
  ```
- Or via Kueue ResourceFlavors that target compliant nodes only
- Skyhook's workload gate taint operates independently — `node-validate` tolerates it (runs before taint removal) and provides a complementary compliance signal

**Autoscaler integration flow:**
```
Autoscaler adds node → GPU Operator installs drivers → Skyhook applies gate taint
  → node-validate DaemonSet pod starts → evaluates recipe constraints
  → labels node with aicr.nvidia.com/recipe-compliant
  → if compliant: workloads with matching nodeSelector can schedule
  → if non-compliant: label blocks workloads → slog/metrics report remediation
```

**Extension points used:**
- `pkg/collector/` — local node data collection (GPU, OS, SystemD, K8s, topology)
- `pkg/constraints/evaluate.go` — constraint evaluation (same as `aicr diff` and `validator.checkReadiness`)
- `pkg/snapshotter/agent.go` — agent mode execution pattern
- `pkg/k8s/client/` — node patching via K8s API

**This does NOT replace KAI Scheduler or Kueue for workload scheduling.** It ensures nodes are recipe-compliant before any scheduler places work on them.

---

## Feature 3: Fleet-Wide Observability

### What

Prometheus metrics for recipe compliance and drift detection, enabling cross-cluster monitoring via federation.

### Why

AICR supports EKS, AKS, GKE, and self-managed clusters, but each cluster is validated independently. Organizations running GPU workloads across multiple clusters have no unified view of recipe compliance, drift status, or validation health. [Kueue's MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) enables multi-cluster job dispatching, but there's no multi-cluster configuration health view.

DCGM provides per-node GPU telemetry. AICR's `kube-prometheus-stack` and `nvsentinel` components are per-cluster. The gap is fleet-level aggregation of configuration health alongside GPU operational metrics.

### How

**Drift detection metrics** (recorded by `aicr diff` and `aicr node-validate`):
- `aicr_recipe_constraint_status{name, severity}` — per-constraint gauge (1=pass, 0=fail, -1=error)
- `aicr_component_drift_status{component, namespace}` — per-component gauge (1=ok, 0=mismatch, -1=not-observed)
- `aicr_drift_check_total{mode, result}` — counter of drift check runs
- `aicr_drift_last_check_timestamp_seconds` — unix timestamp of last check
- `aicr_drift_check_duration_seconds` — histogram of check durations
- `aicr_drift_constraints_passed` / `aicr_drift_constraints_failed` — summary gauges
- `aicr_drift_components_ok` / `aicr_drift_components_drifted` — summary gauges

**Per-node validation metrics** (recorded by `aicr node-validate`):
- `aicr_node_compliant{node}` — 1=compliant, 0=non-compliant
- `aicr_node_constraints_passed{node}` / `aicr_node_constraints_failed{node}` — per-node counts
- `aicr_node_validation_duration_seconds{node}` — histogram
- `aicr_node_validation_total{node, result}` — counter (compliant/non-compliant/error)

**Integration:**
- Uses same `promauto` pattern and `aicr_` prefix as existing AICR metrics (11 metrics in `pkg/server/metrics.go`, `pkg/snapshotter/metrics.go`, `pkg/recipe/metrics.go`)
- `--metrics-port` on `node-validate` exposes `/metrics` endpoint for Prometheus scraping
- ServiceMonitor included for auto-discovery by `kube-prometheus-stack` (configured with `serviceMonitorSelectorNilUsesHelmValues: false`)
- Grafana dashboard provided for fleet compliance visualization
- Cross-cluster federation: add `external_labels: {cluster: "..."}` in each cluster's Prometheus config; query across clusters via Thanos or Prometheus federation

**Example fleet-wide queries:**
- `count(aicr_node_compliant == 0)` — non-compliant nodes across fleet
- `aicr_drift_constraints_failed > 0` — alert on constraint violations
- `rate(aicr_drift_check_total{result="drift"}[1h])` — drift frequency

---

## Implementation Plan

| Phase | Scope | Builds On |
|-------|-------|-----------|
| **1a** | `aicr diff` CLI command | `pkg/constraints/`, `pkg/snapshotter/` |
| **1b** | Drift detection controller + CRD (future work) | Phase 1a + `pkg/k8s/agent/` |
| **2** | `aicr node-validate` per-node compliance | `pkg/collector/`, `pkg/constraints/`, `pkg/k8s/client/` |
| **3** | Fleet observability Prometheus metrics | Phases 1-2 + `kube-prometheus-stack` |

Each phase is independently useful and shippable.

---

## What This RFC Intentionally Does NOT Propose

- **Workload scheduling or placement** — KAI Scheduler handles this ([topology-aware scheduling, gang scheduling, fair-sharing](https://github.com/NVIDIA/KAI-Scheduler))
- **GPU virtualization or fractions** — Run:ai / DRA / HAMi cover this
- **Multi-cloud job dispatching** — SkyPilot and Kueue MultiKueue handle this
- **Cost optimization or FinOps** — out of scope; this is configuration health, not billing
- **Kubernetes admission webhooks** — AICR is a recipe/bundle system, not an operator; per-node validation via DaemonSet fits AICR's agent-based architecture
- **Direct taint management** — node-validate uses labels for compliance signaling; taint lifecycle is managed by Skyhook to avoid conflicts

---

## Open Questions

1. For the drift controller (Phase 1b), should drift reports be stored as CRDs (queryable via kubectl) or ConfigMaps (simpler)?
2. Should `node-validate` emit Kubernetes Events for constraint failures in addition to labels and metrics?

---

## References

- [AICR ROADMAP.md — P2 Drift Detection](https://github.com/NVIDIA/aicr/blob/main/ROADMAP.md)
- [Meta LLaMA 3.1 training failures — 419 interruptions over 54 days](https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/)
- [OpenAI Kubernetes outage — December 2024](https://www.chkk.io/blog/openais-outage-the-complexity-and-fragility-of-modern-ai-infrastructure-on-kubernetes)
- [GPU cluster performance: 30-55% of theoretical (PEARC 2025)](https://dl.acm.org/doi/10.1145/3708035.3736010)
- [NVIDIA DRA Driver donated to CNCF — KubeCon Europe 2026](https://blogs.nvidia.com/blog/nvidia-at-kubecon-2026/)
- [NVIDIA KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler)
- [Kueue Topology-Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/)
- [AICR Design Principles](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#design-principles)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Dynamic GPU Orchestration Layer — From Day-Zero Recipes to Day-N Runtime Intelligence #464

Summary

Upstream Context

Feature 1: Continuous Drift Detection (`aicr diff` + Controller)

What

Why

How

Feature 2: Per-Node Recipe Compliance (`aicr node-validate`)

What

Why

Why not a webhook?

How

Feature 3: Fleet-Wide Observability

What

Why

How

Implementation Plan

What This RFC Intentionally Does NOT Propose

Open Questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Scope	Builds On
1a	`aicr diff` CLI command	`pkg/constraints/`, `pkg/snapshotter/`
1b	Drift detection controller + CRD (future work)	Phase 1a + `pkg/k8s/agent/`
2	`aicr node-validate` per-node compliance	`pkg/collector/`, `pkg/constraints/`, `pkg/k8s/client/`
3	Fleet observability Prometheus metrics	Phases 1-2 + `kube-prometheus-stack`

RFC: Dynamic GPU Orchestration Layer — From Day-Zero Recipes to Day-N Runtime Intelligence #464

Description

Summary

Upstream Context

Feature 1: Continuous Drift Detection (aicr diff + Controller)

What

Why

How

Feature 2: Per-Node Recipe Compliance (aicr node-validate)

What

Why

Why not a webhook?

How

Feature 3: Fleet-Wide Observability

What

Why

How

Implementation Plan

What This RFC Intentionally Does NOT Propose

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature 1: Continuous Drift Detection (`aicr diff` + Controller)

Feature 2: Per-Node Recipe Compliance (`aicr node-validate`)