Skip to content

GPU partitioning orchestration #334

@salexo

Description

@salexo

GPU Partitioning: Kubernetes-native Orchestration for AMD GPUs

Current AMD GPU partitioning docs:

Problem & Outcome

AMD’s Device Config Manager (DCM) can’t partition GPUs in-place. At the moment, partitioning is a manual process that requires multiple manual kubectl commands and active monitoring of the node status. In order to encourage the use of partitioning and to later support dynamic partitioning, we need a Kubernetes-native controller stack that:

  • lets users declare desired partitioning (profiles + plans),
  • cordons/taints/drains nodes safely,
  • applies DCM profiles,
  • verifies readiness and resources,
  • and reflects observable status at plan and node levels.

Scope

  • State machine: Pending → Draining → Applying → WaitingOperator → Verifying → Succeeded
  • Cordon/uncordon
  • Apply/remove taint amd-dcm=up:NoExecute
  • Basic drain via Eviction API (respecting PDBs by default)
  • Apply DCM profile (node label + ConfigMap update if needed)
  • Basic verification (ready label + allocatable resources)
  • Update status + conditions + history
  • Enforce maxParallel / maxUnavailable to limit the number of unavailable nodes

Architecture

CRDs

Add a subgroup for infrastructure-related CRDs infrastructure.silogen.ai to keep separation of concerns clear.

  1. PartitioningProfile (cluster-scoped): reusable GPU partition “recipes” (mapping to DCM ConfigMaps)
  2. PartitioningPlan (cluster-scoped): maps selectors → profile; owns NodePartitioning
  3. NodePartitioning (cluster-scoped): per-node work item/state machine

Controllers

  • Plan Controller: resolves node sets, owns NodePartitioning, aggregates status
  • Node Controller: executes node ops (cordon/taint/drain → DCM apply → verify → uncordon)

CR Samples

PartitioningProfile for MI300X CPX+NPS4

apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
  name: mi300x-cpx-nps4
spec:
  displayName: "MI300X CPX+NPS4 (4 partitions, ~16GB VRAM each)"
  targetSelector:
    matchLabels:
      gpu.vendor: amd
      gpu.family: mi300x
  expectedResources:
    - name: "amd.com/gpu"
      count: 63
  verification:
    readyLabel: "amd.com/gpu.ready"
    timeoutSeconds: 600
  operatorPayload:
    kind: ConfigMap
    name: dcm-profile-cpx-nps4
    namespace: kube-amd-gpu

PartitioningProfile for MI300X SPX+NPS1 (no partitioning)

apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
  name: mi300x-spx-nps1
spec:
  displayName: "MI300X SPX+NPS1 (full GPUs, no partitioning)"
  targetSelector:
    matchLabels:
      gpu.vendor: amd
      gpu.family: mi300x
  expectedResources:
    - name: "amd.com/gpu"
      count: 8
  verification:
    readyLabel: "amd.com/gpu.ready"
    timeoutSeconds: 600
  operatorPayload:
    kind: ConfigMap
    name: dcm-profile-spx-nps1
    namespace: kube-amd-gpu

2) PartitioningPlan for Training/Inference pools

apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningPlan
metadata:
  name: training-pool-partition
spec:
  paused: false
  dryRun: false

  rollout:
    maxParallel: 2
    maxUnavailable: 1
    drainPolicy:
      enabled: true
      timeoutSeconds: 1200
      evictionKind: Eviction
      respectPDB: true

  verification:
    gpuReadyLabel: "amd.com/gpu.ready"
    requirePluginResources:
      - "amd.com/gpu"
    timeoutSeconds: 900

  partitionings:
    - name: "training-mi300x-to-cpx-nps4"
      selector:
        matchLabels:
          nodepool: training-gpu
          gpu.vendor: amd
          gpu.family: mi300x
      exclude:
        matchLabels:
          maintenance: "true"
      profileRef:
        kind: PartitioningProfile
        name: mi300x-cpx-nps4

    - name: "inference-mi300x-keep-full"
      selector:
        matchLabels:
          nodepool: inference-gpu
          gpu.vendor: amd
          gpu.family: mi300x
      profileRef:
        kind: PartitioningProfile
        name: mi300x-spx-nps1

3) NodePartitioning matching per-node work item (owned by plan)

apiVersion: infrastructure.silogen.ai/v1alpha1
kind: NodePartitioning
metadata:
  name: training-pool-partition-gpu-node-02
  ownerReferences:
    - apiVersion: kaiwo.silogen.ai/v1alpha1
      kind: PartitioningPlan
      name: training-pool-partition
      uid: 12345678-1234-1234-1234-123456789abc
      controller: true
      blockOwnerDeletion: true
spec:
  planRef:
    name: training-pool-partition
    uid: 12345678-1234-1234-1234-123456789abc
  nodeName: gpu-node-02
  desiredHash: "sha256:a1b2c3d4e5f6..."  # profile + drain + verification
  profileRef:
    kind: PartitioningProfile
    name: mi300x-cpx-nps4
  drainPolicy:
    enabled: true
    timeoutSeconds: 1200
    evictionKind: Eviction
    respectPDB: true
  verification:
    gpuReadyLabel: "amd.com/gpu.ready"
    requirePluginResources:
      - "amd.com/gpu"
    timeoutSeconds: 900
status:
  phase: Pending

Prerequisites & Assumptions

  • AMD GPU Operator + DCM installed in kube-amd-gpu
  • DCM ConfigMap exists/managed (controller may update it)
  • AMD device plugin DaemonSet running
  • System pods tolerate amd-dcm=up:NoExecute (one-time ops playbook; Phase 1 does not automate)
  • Control plane nodes excluded via labels/selectors (no partitioning there)

Follow-ups (tracked in later issues)

  • Canary deployment sets
  • Change windows (apply during night, etc.)
  • DRA integration

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions