What you would like to be added?
GPU memory isolation for shared GPU workloads, integrating with HAMI Core to enforce hard memory limits on containers sharing a GPU.
Design: #60
The integration consists of two independently deployed components:
- kai-resource-isolator (external, hosted under HAMI) — a DaemonSet that deploys HAMI Core libraries to GPU nodes, and a mutating webhook that injects volume mounts into GPU-sharing pods.
- KAI Scheduler — injects a
GPU_MEMORY_LIMIT environment variable into containers requesting shared GPUs. For GPU-memory requests, the value is known at pod creation. For GPU-fraction requests, the value is resolved after the scheduling decision determines which GPU node the pod lands on.
Flow once both components are deployed:
- Pod requesting GPU sharing is submitted
- HAMI mutating webhook injects a volume mount for the HAMI Core library
- KAI Scheduler determines the appropriate node and sets
GPU_MEMORY_LIMIT accordingly
- Container runs with HAMI Core enforcing the memory limit
After both components are implemented and tested, a user guide will be added to the documentation.
Why is this needed?
KAI Scheduler currently does not enforce resource isolation when using GPU sharing. Multiple containers sharing a GPU can consume more memory than allocated, leading to OOM kills and instability. Related issues: #49, #45.
What you would like to be added?
GPU memory isolation for shared GPU workloads, integrating with HAMI Core to enforce hard memory limits on containers sharing a GPU.
Design: #60
The integration consists of two independently deployed components:
GPU_MEMORY_LIMITenvironment variable into containers requesting shared GPUs. For GPU-memory requests, the value is known at pod creation. For GPU-fraction requests, the value is resolved after the scheduling decision determines which GPU node the pod lands on.Flow once both components are deployed:
GPU_MEMORY_LIMITaccordinglyAfter both components are implemented and tested, a user guide will be added to the documentation.
Why is this needed?
KAI Scheduler currently does not enforce resource isolation when using GPU sharing. Multiple containers sharing a GPU can consume more memory than allocated, leading to OOM kills and instability. Related issues: #49, #45.