feat: integrate HAMI Core for GPU memory isolation on shared GPUs

### What you would like to be added?

GPU memory isolation for shared GPU workloads, integrating with [HAMI Core](https://github.com/Project-HAMi/HAMi-core) to enforce hard memory limits on containers sharing a GPU.

Design: #60

The integration consists of two independently deployed components:

1. **kai-resource-isolator** (external, hosted under HAMI) — a DaemonSet that deploys HAMI Core libraries to GPU nodes, and a mutating webhook that injects volume mounts into GPU-sharing pods.
2. **KAI Scheduler** — injects a `GPU_MEMORY_LIMIT` environment variable into containers requesting shared GPUs. For GPU-memory requests, the value is known at pod creation. For GPU-fraction requests, the value is resolved after the scheduling decision determines which GPU node the pod lands on.

Flow once both components are deployed:
1. Pod requesting GPU sharing is submitted
2. HAMI mutating webhook injects a volume mount for the HAMI Core library
3. KAI Scheduler determines the appropriate node and sets `GPU_MEMORY_LIMIT` accordingly
4. Container runs with HAMI Core enforcing the memory limit

After both components are implemented and tested, a user guide will be added to the documentation.

### Why is this needed?

KAI Scheduler currently does not enforce resource isolation when using GPU sharing. Multiple containers sharing a GPU can consume more memory than allocated, leading to OOM kills and instability. Related issues: #49, #45.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate HAMI Core for GPU memory isolation on shared GPUs #1364

What you would like to be added?

Why is this needed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: integrate HAMI Core for GPU memory isolation on shared GPUs #1364

Description

What you would like to be added?

Why is this needed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions