Skip to content

feat(recipes): add NIM Operator recipe for CNCF AI Conformance#478

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/nim-operator-recipe
Apr 2, 2026
Merged

feat(recipes): add NIM Operator recipe for CNCF AI Conformance#478
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/nim-operator-recipe

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

  • Add k8s-nim-operator as a new AICR component for NVIDIA NIM inference platform
  • Create H100/EKS/Ubuntu inference recipe for NIM on EKS conformance certification
  • Fix health check assert file loading (pre-existing gap affecting all components)

Changes

Recipe & Component:

  • Add nim platform type to recipe criteria (pkg/recipe/criteria.go)
  • Register k8s-nim-operator v3.1.0 in recipes/registry.yaml with health check, node scheduling paths (operator.nodeSelector, operator.tolerations), and admission controller config
  • Create h100-eks-ubuntu-inference-nim.yaml overlay inheriting from h100-eks-ubuntu-inference, adds NIM Operator + DRA support, full conformance validation suite (11 checks)
  • Add recipes/components/k8s-nim-operator/values.yaml with correct chart value paths and EKS-compatible affinity

Health Check Fix:

  • Load healthCheck.assertFile content into ComponentRef.HealthCheckAsserts in ApplyRegistryDefaults — previously the file path was registered but content was never read, so deployment validation never executed Chainsaw health checks for any component
  • Add targeted unit tests for the loading path (success, no-overwrite, missing file)

Demo & Workload:

  • Add demos/workloads/inference/nimservice-llama-3-2-1b.yaml — NIMService CR deploying Llama 3.2 1B with NGC secrets, PVC, and GPU tolerations
  • Add demos/workloads/inference/nim-chat-server.sh and nim-chat.html — chat UI for testing NIM inference

Test plan

  • go test -race ./pkg/recipe/... passes (criteria + health check loading tests)
  • aicr recipe --platform nim generates correct 15-component recipe
  • aicr bundle generates correct values with operator.tolerations and operator.nodeSelector
  • Full deploy on EKS cluster (H100, 6 nodes) — all 15 components deployed successfully
  • NIMService CR deploys Llama 3.2 1B, serves /v1/chat/completions
  • CNCF conformance validation passes 9/9 checks
  • Chat UI works via nim-chat-server.sh

Closes #473

@mchmarny mchmarny enabled auto-merge (squash) April 2, 2026 11:29
Add k8s-nim-operator as a new AICR component and create an H100/EKS/Ubuntu
inference recipe for NIM. This supports the CNCF AI Conformance submission
where NIM on EKS is the certified product and AICR is the validation tooling.

- Add `nim` platform type to recipe criteria with tests
- Register k8s-nim-operator v3.1.0 in component registry with health check
- Create h100-eks-ubuntu-inference-nim overlay with DRA support
- Add NIMService workload manifest (Llama 3.2 1B)
- Add NIM chat demo UI (nim-chat-server.sh, nim-chat.html)
- Fix: load healthCheck.assertFile content in ApplyRegistryDefaults so
  deployment validation actually executes Chainsaw health checks

Closes NVIDIA#473
@yuanchen8911 yuanchen8911 force-pushed the feat/nim-operator-recipe branch from 5240df9 to 1ca8a58 Compare April 2, 2026 15:41
@mchmarny mchmarny merged commit a66de21 into NVIDIA:main Apr 2, 2026
57 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add NIM Operator recipe for CNCF AI Conformance certification

3 participants