-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
GPU Partitioning: Kubernetes-native Orchestration for AMD GPUs
Current AMD GPU partitioning docs:
- https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/overview.html
- https://instinct.docs.amd.com/projects/gpu-operator/en/latest/dcm/applying-partition-profiles.html
Problem & Outcome
AMD’s Device Config Manager (DCM) can’t partition GPUs in-place. At the moment, partitioning is a manual process that requires multiple manual kubectl commands and active monitoring of the node status. In order to encourage the use of partitioning and to later support dynamic partitioning, we need a Kubernetes-native controller stack that:
- lets users declare desired partitioning (profiles + plans),
- cordons/taints/drains nodes safely,
- applies DCM profiles,
- verifies readiness and resources,
- and reflects observable status at plan and node levels.
Scope
- State machine:
Pending → Draining → Applying → WaitingOperator → Verifying → Succeeded - Cordon/uncordon
- Apply/remove taint
amd-dcm=up:NoExecute - Basic drain via Eviction API (respecting PDBs by default)
- Apply DCM profile (node label + ConfigMap update if needed)
- Basic verification (ready label + allocatable resources)
- Update status + conditions + history
- Enforce
maxParallel/maxUnavailableto limit the number of unavailable nodes
Architecture
CRDs
Add a subgroup for infrastructure-related CRDs infrastructure.silogen.ai to keep separation of concerns clear.
- PartitioningProfile (cluster-scoped): reusable GPU partition “recipes” (mapping to DCM ConfigMaps)
- PartitioningPlan (cluster-scoped): maps selectors → profile; owns NodePartitioning
- NodePartitioning (cluster-scoped): per-node work item/state machine
Controllers
- Plan Controller: resolves node sets, owns NodePartitioning, aggregates status
- Node Controller: executes node ops (cordon/taint/drain → DCM apply → verify → uncordon)
CR Samples
PartitioningProfile for MI300X CPX+NPS4
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
name: mi300x-cpx-nps4
spec:
displayName: "MI300X CPX+NPS4 (4 partitions, ~16GB VRAM each)"
targetSelector:
matchLabels:
gpu.vendor: amd
gpu.family: mi300x
expectedResources:
- name: "amd.com/gpu"
count: 63
verification:
readyLabel: "amd.com/gpu.ready"
timeoutSeconds: 600
operatorPayload:
kind: ConfigMap
name: dcm-profile-cpx-nps4
namespace: kube-amd-gpuPartitioningProfile for MI300X SPX+NPS1 (no partitioning)
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningProfile
metadata:
name: mi300x-spx-nps1
spec:
displayName: "MI300X SPX+NPS1 (full GPUs, no partitioning)"
targetSelector:
matchLabels:
gpu.vendor: amd
gpu.family: mi300x
expectedResources:
- name: "amd.com/gpu"
count: 8
verification:
readyLabel: "amd.com/gpu.ready"
timeoutSeconds: 600
operatorPayload:
kind: ConfigMap
name: dcm-profile-spx-nps1
namespace: kube-amd-gpu2) PartitioningPlan for Training/Inference pools
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: PartitioningPlan
metadata:
name: training-pool-partition
spec:
paused: false
dryRun: false
rollout:
maxParallel: 2
maxUnavailable: 1
drainPolicy:
enabled: true
timeoutSeconds: 1200
evictionKind: Eviction
respectPDB: true
verification:
gpuReadyLabel: "amd.com/gpu.ready"
requirePluginResources:
- "amd.com/gpu"
timeoutSeconds: 900
partitionings:
- name: "training-mi300x-to-cpx-nps4"
selector:
matchLabels:
nodepool: training-gpu
gpu.vendor: amd
gpu.family: mi300x
exclude:
matchLabels:
maintenance: "true"
profileRef:
kind: PartitioningProfile
name: mi300x-cpx-nps4
- name: "inference-mi300x-keep-full"
selector:
matchLabels:
nodepool: inference-gpu
gpu.vendor: amd
gpu.family: mi300x
profileRef:
kind: PartitioningProfile
name: mi300x-spx-nps13) NodePartitioning matching per-node work item (owned by plan)
apiVersion: infrastructure.silogen.ai/v1alpha1
kind: NodePartitioning
metadata:
name: training-pool-partition-gpu-node-02
ownerReferences:
- apiVersion: kaiwo.silogen.ai/v1alpha1
kind: PartitioningPlan
name: training-pool-partition
uid: 12345678-1234-1234-1234-123456789abc
controller: true
blockOwnerDeletion: true
spec:
planRef:
name: training-pool-partition
uid: 12345678-1234-1234-1234-123456789abc
nodeName: gpu-node-02
desiredHash: "sha256:a1b2c3d4e5f6..." # profile + drain + verification
profileRef:
kind: PartitioningProfile
name: mi300x-cpx-nps4
drainPolicy:
enabled: true
timeoutSeconds: 1200
evictionKind: Eviction
respectPDB: true
verification:
gpuReadyLabel: "amd.com/gpu.ready"
requirePluginResources:
- "amd.com/gpu"
timeoutSeconds: 900
status:
phase: PendingPrerequisites & Assumptions
- AMD GPU Operator + DCM installed in
kube-amd-gpu - DCM ConfigMap exists/managed (controller may update it)
- AMD device plugin DaemonSet running
- System pods tolerate
amd-dcm=up:NoExecute(one-time ops playbook; Phase 1 does not automate) - Control plane nodes excluded via labels/selectors (no partitioning there)
Follow-ups (tracked in later issues)
- Canary deployment sets
- Change windows (apply during night, etc.)
- DRA integration
Metadata
Metadata
Assignees
Labels
No labels