A comprehensive GPU management system designed for multi-node Kubernetes clusters, specifically optimized for environments with 4x NVIDIA RTX 5090 GPUs per node.
This system consists of four core components managed as Git Submodules:
K8s Cluster Infrastructure
- Responsible for building multi-node K8s clusters
- Configures multi-GPU resource pool sharing mechanisms
- Sets up networking and storage infrastructure
- Integrates with k8s-device-plugin for GPU scheduling support
GPU Device Plugin (Custom Build)
- Custom modified version based on NVIDIA Device Plugin
- Core Feature: Supports multi-GPU scheduling within a single Pod
- Enables Multi-GPU CUDA workloads
- Provides GPU MPS (Multi-Process Service) support
- Implements fine-grained GPU resource allocation
3. frontend
User Interface Layer
- Provides the sole interface for users to access the K8s cluster
- Web UI for cluster management and monitoring
- Offers multiple functionalities:
- GPU resource visualization
- Pod/Job management
- User access control
- Real-time monitoring dashboard
4. backend
API Server
- Provides backend API services
- Handle all API requests from frontend
- Interacts with K8s API Server
- Manages user authentication and authorization
- Handles GPU resource allocation logic
┌─────────────────────────────────────────────────────┐
│ User Access │
└──────────────────────┬──────────────────────────────┘
│
▼
┌──────────────────────┐
│ frontend │ (Web UI)
└──────────┬───────────┘
│ API Calls
▼
┌──────────────────────┐
│ backend │ (API Server)
└──────────┬───────────┘
│ K8s API
▼
┌──────────────────────┐
│ K8s Cluster │
│ k8s-cluster-setup │ (Infrastructure)
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ k8s-device-plugin │ (GPU Scheduler)
│ (Multi-GPU Support) │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ 4x RTX 5090 GPUs │ (Per Node)
└──────────────────────┘
# Clone main repo with all submodules
git clone --recurse-submodules <main-repo-url>
cd k8s
# Or if already cloned, initialize submodules
git submodule update --init --recursive# Update all submodules to their latest remote state
git submodule update --remote --merge
# Or update individually
cd k8s-cluster-setup && git pull origin master && cd ..
cd k8s-device-plugin && git pull origin mps-individual-gpu && cd ..
cd frontend && git pull origin main && cd ..
cd backend && git pull origin main && cd ..-
Build K8s Cluster
cd k8s-cluster-setup # Follow its README.md to deploy K8s cluster
-
Deploy GPU Device Plugin
cd k8s-device-plugin # Build and deploy custom GPU plugin (mps-individual-gpu branch) make build kubectl apply -f deployments/
-
Deploy API Server
cd backend # Deploy backend API service
-
Deploy Frontend UI
cd frontend # Deploy Web UI
-
Start with the
k8s-cluster-setupsubmodule. It creates the Kubernetes cluster and network settings that other components depend on. -
Network and interface settings are environment specific. Update scripts in
k8s-cluster-setup/scripts/(interface name, CIDR, MASTER_IP, HARBOR_IP) before running them. -
Backend requires an initial database and an admin account. Run the SQL seed at
backend/infra/db/schema.sql. Provide a.envor K8s Secret with correct DB credentials so initialization can succeed. -
Before deploying the backend, build and push the backend image to your registry. Use
backend/scripts/build_image.shbut edit its registry/namespace/image variables to match your Harbor or registry. -
Development manifests may use
hostPathmounts. Do not use these in production. ReplacehostPathwithSecret,ConfigMap, orPVCfor shared clusters. -
Suggested backend apply order after image is available in the registry:
kubectl apply -f ca.yaml(if present)kubectl apply -f go-api.yamlkubectl apply -f postgres.yaml
-
Frontend development: run the dev server on the host for fast iteration (for example, use
tmuxandnpm run dev). To deploy the frontend to K8s, build static files (npm run build), build a Docker image, push to your registry, and update the frontend manifest image.
A checklist of exact commands can be added if needed.
# Enter submodule
cd k8s-device-plugin
# Create new branch and develop
git checkout -b feature/new-feature
# ... make changes ...
git add .
git commit -m "Add new feature"
git push origin feature/new-feature
# Return to main repo and update submodule reference
cd ..
git add k8s-device-plugin
git commit -m "Update k8s-device-plugin to latest"# One-command update all submodules to latest
./scripts/update-all-submodules.sh- Node Configuration: Each node requires 4x NVIDIA RTX 5090 GPUs
- Network: High-speed interconnect network (recommended 10GbE or higher)
- Storage: Shared storage system (e.g., NFS, Ceph)
- Infrastructure: Kubernetes, Containerd
- GPU Management: NVIDIA Device Plugin (Custom), CUDA, MPS
- Backend: Go (backend)
- Frontend: TypeScript, React, Vite (frontend)
- Build Tools: Docker, Helm
- k8s-cluster-setup: @linskybing
- k8s-device-plugin: @linskybing (mps-individual-gpu branch)
- backend: @linskybing
- frontend: @ted1204
Please refer to individual LICENSE files in each submodule.
kubectl describe nodes | grep nvidia.com/gpuRefer to example YAML files in k8s-device-plugin/tests/.
cd <submodule-name>
git pull origin <branch-name>
cd ..
git add <submodule-name>
git commit -m "Update <submodule-name>"Last Updated: 2026-01-15