A helper tool that deploys AMD Enterprise AI Suite into Kubernetes cluster.
Cluster-Forge is a tool designed to bundle various third-party, community, and in-house components into a single, streamlined stack that can be deployed in Kubernetes clusters. By automating the deployment process, Cluster-Forge simplifies the creation of consistent, ready-to-use clusters.
This tool is ideal for scenarios such as:
- Ephemeral test clusters - Create temporary environments quickly
- CI/CD pipeline clusters - Ensure consistent testing environments
- Multiple production clusters - Manage a fleet of clusters efficiently
- Reproducible environments - Ensure consistency across deployments
Just run the following bootstrap script to create AMD Enterprise AI Suite in your k8s cluster. More details of the script execution steps are available here.
./scripts/bootstrap.sh <domain>Cluster-Forge deploys all necessary components within the cluster using GitOps-controller ArgoCD and app-of-apps pattern where Cluster-Forge acts as an app of apps.
- Longhorn - Cloud native distributed storage solution
- MetalLB - Load-balancer implementation for bare metal clusters
- CertManager - Certificate management controller
- External Secrets - Kubernetes operator for external secrets management
- Gateway API - Next generation Kubernetes Ingress
- KGateway - Kubernetes Gateway implementation
- Grafana - Metrics visualization and dashboards
- Prometheus - Monitoring system and time series database
- Grafana Loki - Log aggregation system
- Grafana Mimir - Highly available metrics backend
- Promtail - Log collector for Loki
- OpenObserve - Observability platform
- OpenTelemetry Operator - Telemetry collection and management
- OTEL-LGTM Stack - OpenTelemetry with Loki, Grafana, Tempo, and Mimir
- Kube-Prometheus-Stack - End-to-end Kubernetes cluster monitoring
- MinIO Operator - Kubernetes operator for MinIO object storage
- MinIO Tenant - Multi-tenant MinIO deployment
- CNPG Operator - Cloud Native PostgreSQL operator
- AMD GPU Operator - GPU operator for AMD Instinct GPUs
- AMD Device Config - Device configuration for AMD GPUs
- KubeRay Operator - Kubernetes operator for Ray
- Kueue - Job queue controller for Kubernetes
- AppWrapper - Application wrapper for job scheduling
- Kaiwo - ML workflow management
- KEDA - Kubernetes Event-driven Autoscaling
- Kedify OTEL - OpenTelemetry add-on for KEDA
- Kyverno - Kubernetes policy engine
- KeyCloak - SSO and identity & access management
Storage classes are provided by default with Longhorn. These can be customized as needed.
| Purpose | StorageClass | Access Mode | Locality |
|---|---|---|---|
| GPU Job | mlstorage | RWO | LOCAL/remote |
| GPU Job | default | RWO | LOCAL/remote |
| Advanced usage | direct | RWO | LOCAL |
| Multi-container | multinode | RWX | ANYWHERE |
Cluster-Forge deploys all ArgoCD applications into the cluster using root helm chart.
Cluster-Forge is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
Give Cluster-Forge a try and let us know how it works for you!