-
Notifications
You must be signed in to change notification settings - Fork 26
[Feature]: Add HAMi integration as an AICR recipe/component #481
Description
Prerequisites
- I searched existing issues
Feature Summary
HAMi is a CNCF sandbox project. It is a cloud-native GPU virtualization middleware that brings sharing, isolation and scheduling of heterogeneous accelerators to AI workloads on Kubernetes.
Support integrating HAMi as a “recipe” capability in this project, so users can enable GPU resource-isolation and topology-aware scheduling through AICR’s existing recipe/overlay/bundle workflow.
Problem/Use Case
HAMi provides a unified scheduler extender and device plugins for GPU sharing, and has been adopted by over 200+ companies, i think it would be great to offer an out-of-the-box recipe/component that installs and configures HAMi in a way consistent with AICR’s deployment model.
Otherwise, users must manage HAMi via manual Helm installation (and custom cluster configuration), which:
- increases setup effort and configuration drift risk,
- complicates upgrades and reproducibility across environments,
- makes it harder to combine HAMi with AICR-managed components (e.g., GPU operator, gang scheduling, inference/training overlays).
Proposed Solution
- Add a new AICR component and recipe for HAMi, installed via Helm including:
- HAMi scheduler extender components
- HAMi device plugin components
- any required admission webhook/mutating webhook components
- Register HAMi in
recipes/registry.yamlso it can be selected consistently by AICR recipes. - Add one or more overlays so HAMi can be enabled for relevant inference recipes, while keeping the default behavior unchanged.
- Add health checks/assertions (Chainsaw-based, consistent with other components) so
aicr validatecan confirm HAMi is installed and healthy.
Note HAMi will not be enabled by default in any recipe, it can only be installed by manually set
overrides.enabled=trueduring bundle.
Success Criteria
- A user can generate a recipe/bundle that includes HAMi and successfully installs it with correct namespace/RBAC/config when manually enable in during bundle.
aicr validate(or component health checks) can detect HAMi readiness/health.- In a cluster, workloads that rely on HAMi’s sharing/scheduling can be scheduled successfully, and AICR bundle output remains reproducible.
- Clear documentation describes how users enable HAMi and what cluster prerequisites are required.
Alternatives Considered
Manual Helm installation of HAMi outside AICR (current approach).
Component
Bundlers (gpu-operator, network-operator, etc.)
Priority
Important (would improve my workflow)
Compatibility / Breaking Changes
Conflict with 'KAI-scheduler'
Conflict with 'NVIDIA/k8s-device-plugin'
Conflict with 'NVIDIA/Mig-manager'
because of that, we won't enable hami by default in any existing overlays, so, either we set 'overlays.enabled to false' or we add our 'hami' overlays.
Operational Considerations
No response
Are you willing to contribute?
Yes, I can open a PR