Skip to content

feat: integrate HAMI Core for GPU memory isolation on shared GPUs #1364

@enoodle

Description

@enoodle

What you would like to be added?

GPU memory isolation for shared GPU workloads, integrating with HAMI Core to enforce hard memory limits on containers sharing a GPU.

Design: #60

The integration consists of two independently deployed components:

  1. kai-resource-isolator (external, hosted under HAMI) — a DaemonSet that deploys HAMI Core libraries to GPU nodes, and a mutating webhook that injects volume mounts into GPU-sharing pods.
  2. KAI Scheduler — injects a GPU_MEMORY_LIMIT environment variable into containers requesting shared GPUs. For GPU-memory requests, the value is known at pod creation. For GPU-fraction requests, the value is resolved after the scheduling decision determines which GPU node the pod lands on.

Flow once both components are deployed:

  1. Pod requesting GPU sharing is submitted
  2. HAMI mutating webhook injects a volume mount for the HAMI Core library
  3. KAI Scheduler determines the appropriate node and sets GPU_MEMORY_LIMIT accordingly
  4. Container runs with HAMI Core enforcing the memory limit

After both components are implemented and tested, a user guide will be added to the documentation.

Why is this needed?

KAI Scheduler currently does not enforce resource isolation when using GPU sharing. Multiple containers sharing a GPU can consume more memory than allocated, leading to OOM kills and instability. Related issues: #49, #45.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions