GPU partitioning orchestration

# GPU Partitioning: Kubernetes-native Orchestration for AMD GPUs

Current AMD GPU partitioning docs:

* https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/overview.html
* https://instinct.docs.amd.com/projects/gpu-operator/en/latest/dcm/applying-partition-profiles.html

## Problem & Outcome

AMD’s Device Config Manager (DCM) can’t partition GPUs in-place. At the moment, partitioning is a manual process that requires multiple manual `kubectl` commands and active monitoring of the node status. In order to encourage the use of partitioning and to later support dynamic partitioning, we need a Kubernetes-native controller stack that:

* lets users **declare** desired partitioning (profiles + plans),
* **cordons/taints/drains** nodes safely,
* **applies** DCM profiles,
* **verifies** readiness and resources,
* and reflects **observable status** at plan and node levels.

## Scope

* State machine: `Pending → Draining → Applying → WaitingOperator → Verifying → Succeeded`
* Cordon/uncordon
* Apply/remove taint `amd-dcm=up:NoExecute`
* Basic drain via Eviction API (respecting PDBs by default)
* Apply DCM profile (node label + ConfigMap update if needed)
* Basic verification (ready label + allocatable resources)
* Update status + conditions + history
* Enforce `maxParallel` / `maxUnavailable` to limit the number of unavailable nodes

## Architecture

**CRDs**

Add a subgroup for infrastructure-related CRDs `infrastructure.silogen.ai` to keep separation of concerns clear.

1. **PartitioningProfile** (cluster-scoped): reusable GPU partition “recipes” (mapping to DCM ConfigMaps)
2. **PartitioningPlan** (cluster-scoped): maps selectors → profile; owns NodePartitioning
3. **NodePartitioning** (cluster-scoped): per-node work item/state machine

**Controllers**

* **Plan Controller:** resolves node sets, owns NodePartitioning, aggregates status
* **Node Controller:** executes node ops (cordon/taint/drain → DCM apply → verify → uncordon)

## CR Samples

### `PartitioningProfile` for MI300X CPX+NPS4

```yaml
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
  name: mi300x-cpx-nps4
spec:
  displayName: "MI300X CPX+NPS4 (4 partitions, ~16GB VRAM each)"
  targetSelector:
    matchLabels:
      gpu.vendor: amd
      gpu.family: mi300x
  expectedResources:
    - name: "amd.com/gpu"
      count: 63
  verification:
    readyLabel: "amd.com/gpu.ready"
    timeoutSeconds: 600
  operatorPayload:
    kind: ConfigMap
    name: dcm-profile-cpx-nps4
    namespace: kube-amd-gpu
```

### `PartitioningProfile` for MI300X SPX+NPS1 (no partitioning)

```yaml
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
  name: mi300x-spx-nps1
spec:
  displayName: "MI300X SPX+NPS1 (full GPUs, no partitioning)"
  targetSelector:
    matchLabels:
      gpu.vendor: amd
      gpu.family: mi300x
  expectedResources:
    - name: "amd.com/gpu"
      count: 8
  verification:
    readyLabel: "amd.com/gpu.ready"
    timeoutSeconds: 600
  operatorPayload:
    kind: ConfigMap
    name: dcm-profile-spx-nps1
    namespace: kube-amd-gpu
```

### 2) `PartitioningPlan` for Training/Inference pools

```yaml
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningPlan
metadata:
  name: training-pool-partition
spec:
  paused: false
  dryRun: false

  rollout:
    maxParallel: 2
    maxUnavailable: 1
    drainPolicy:
      enabled: true
      timeoutSeconds: 1200
      evictionKind: Eviction
      respectPDB: true

  verification:
    gpuReadyLabel: "amd.com/gpu.ready"
    requirePluginResources:
      - "amd.com/gpu"
    timeoutSeconds: 900

  partitionings:
    - name: "training-mi300x-to-cpx-nps4"
      selector:
        matchLabels:
          nodepool: training-gpu
          gpu.vendor: amd
          gpu.family: mi300x
      exclude:
        matchLabels:
          maintenance: "true"
      profileRef:
        kind: PartitioningProfile
        name: mi300x-cpx-nps4

    - name: "inference-mi300x-keep-full"
      selector:
        matchLabels:
          nodepool: inference-gpu
          gpu.vendor: amd
          gpu.family: mi300x
      profileRef:
        kind: PartitioningProfile
        name: mi300x-spx-nps1
```

### 3) `NodePartitioning` matching per-node work item (owned by plan)

```yaml
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: NodePartitioning
metadata:
  name: training-pool-partition-gpu-node-02
  ownerReferences:
    - apiVersion: kaiwo.silogen.ai/v1alpha1
      kind: PartitioningPlan
      name: training-pool-partition
      uid: 12345678-1234-1234-1234-123456789abc
      controller: true
      blockOwnerDeletion: true
spec:
  planRef:
    name: training-pool-partition
    uid: 12345678-1234-1234-1234-123456789abc
  nodeName: gpu-node-02
  desiredHash: "sha256:a1b2c3d4e5f6..."  # profile + drain + verification
  profileRef:
    kind: PartitioningProfile
    name: mi300x-cpx-nps4
  drainPolicy:
    enabled: true
    timeoutSeconds: 1200
    evictionKind: Eviction
    respectPDB: true
  verification:
    gpuReadyLabel: "amd.com/gpu.ready"
    requirePluginResources:
      - "amd.com/gpu"
    timeoutSeconds: 900
status:
  phase: Pending
```

## Prerequisites & Assumptions

* AMD GPU Operator + DCM installed in `kube-amd-gpu`
* DCM ConfigMap exists/managed (controller may update it)
* AMD device plugin DaemonSet running
* **System pods tolerate** `amd-dcm=up:NoExecute` (one-time ops playbook; Phase 1 does not automate)
* Control plane nodes excluded via labels/selectors (no partitioning there)

## Follow-ups (tracked in later issues)

* Canary deployment sets
* Change windows (apply during night, etc.)
* DRA integration


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU partitioning orchestration #334

GPU Partitioning: Kubernetes-native Orchestration for AMD GPUs

Problem & Outcome

Scope

Architecture

CR Samples

`PartitioningProfile` for MI300X CPX+NPS4

`PartitioningProfile` for MI300X SPX+NPS1 (no partitioning)

2) `PartitioningPlan` for Training/Inference pools

3) `NodePartitioning` matching per-node work item (owned by plan)

Prerequisites & Assumptions

Follow-ups (tracked in later issues)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU partitioning orchestration #334

Description

GPU Partitioning: Kubernetes-native Orchestration for AMD GPUs

Problem & Outcome

Scope

Architecture

CR Samples

PartitioningProfile for MI300X CPX+NPS4

PartitioningProfile for MI300X SPX+NPS1 (no partitioning)

2) PartitioningPlan for Training/Inference pools

3) NodePartitioning matching per-node work item (owned by plan)

Prerequisites & Assumptions

Follow-ups (tracked in later issues)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`PartitioningProfile` for MI300X CPX+NPS4

`PartitioningProfile` for MI300X SPX+NPS1 (no partitioning)

2) `PartitioningPlan` for Training/Inference pools

3) `NodePartitioning` matching per-node work item (owned by plan)