Skip to content

[Feature]: Add HAMi integration as an AICR recipe/component #481

@archlitchi

Description

@archlitchi

Prerequisites

  • I searched existing issues

Feature Summary

HAMi is a CNCF sandbox project. It is a cloud-native GPU virtualization middleware that brings sharing, isolation and scheduling of heterogeneous accelerators to AI workloads on Kubernetes.
Support integrating HAMi as a “recipe” capability in this project, so users can enable GPU resource-isolation and topology-aware scheduling through AICR’s existing recipe/overlay/bundle workflow.

Problem/Use Case

HAMi provides a unified scheduler extender and device plugins for GPU sharing, and has been adopted by over 200+ companies, i think it would be great to offer an out-of-the-box recipe/component that installs and configures HAMi in a way consistent with AICR’s deployment model.
Otherwise, users must manage HAMi via manual Helm installation (and custom cluster configuration), which:

  • increases setup effort and configuration drift risk,
  • complicates upgrades and reproducibility across environments,
  • makes it harder to combine HAMi with AICR-managed components (e.g., GPU operator, gang scheduling, inference/training overlays).

Proposed Solution

  1. Add a new AICR component and recipe for HAMi, installed via Helm including:
  • HAMi scheduler extender components
  • HAMi device plugin components
  • any required admission webhook/mutating webhook components
  1. Register HAMi in recipes/registry.yaml so it can be selected consistently by AICR recipes.
  2. Add one or more overlays so HAMi can be enabled for relevant inference recipes, while keeping the default behavior unchanged.
  3. Add health checks/assertions (Chainsaw-based, consistent with other components) so aicr validate can confirm HAMi is installed and healthy.

Note HAMi will not be enabled by default in any recipe, it can only be installed by manually set overrides.enabled=true during bundle.

Success Criteria

  • A user can generate a recipe/bundle that includes HAMi and successfully installs it with correct namespace/RBAC/config when manually enable in during bundle.
  • aicr validate (or component health checks) can detect HAMi readiness/health.
  • In a cluster, workloads that rely on HAMi’s sharing/scheduling can be scheduled successfully, and AICR bundle output remains reproducible.
  • Clear documentation describes how users enable HAMi and what cluster prerequisites are required.

Alternatives Considered

Manual Helm installation of HAMi outside AICR (current approach).

Component

Bundlers (gpu-operator, network-operator, etc.)

Priority

Important (would improve my workflow)

Compatibility / Breaking Changes

Conflict with 'KAI-scheduler'
Conflict with 'NVIDIA/k8s-device-plugin'
Conflict with 'NVIDIA/Mig-manager'

because of that, we won't enable hami by default in any existing overlays, so, either we set 'overlays.enabled to false' or we add our 'hami' overlays.

Operational Considerations

No response

Are you willing to contribute?

Yes, I can open a PR

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions