Multi-Node K8s GPU Management System

A comprehensive GPU management system designed for multi-node Kubernetes clusters, specifically optimized for environments with 4x NVIDIA RTX 5090 GPUs per node.

System Architecture

This system consists of four core components managed as Git Submodules:

1. k8s-cluster-setup

K8s Cluster Infrastructure

Responsible for building multi-node K8s clusters
Configures multi-GPU resource pool sharing mechanisms
Sets up networking and storage infrastructure
Integrates with k8s-device-plugin for GPU scheduling support

2. k8s-device-plugin

GPU Device Plugin (Custom Build)

Custom modified version based on NVIDIA Device Plugin
Core Feature: Supports multi-GPU scheduling within a single Pod
Enables Multi-GPU CUDA workloads
Provides GPU MPS (Multi-Process Service) support
Implements fine-grained GPU resource allocation

3. frontend

User Interface Layer

Provides the sole interface for users to access the K8s cluster
Web UI for cluster management and monitoring
Offers multiple functionalities:
- GPU resource visualization
- Pod/Job management
- User access control
- Real-time monitoring dashboard

4. backend

API Server

Provides backend API services
Handle all API requests from frontend
Interacts with K8s API Server
Manages user authentication and authorization
Handles GPU resource allocation logic

Component Architecture Diagram

┌─────────────────────────────────────────────────────┐
│                   User Access                        │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
            ┌──────────────────────┐
            │   frontend           │ (Web UI)
            └──────────┬───────────┘
                       │ API Calls
                       ▼
            ┌──────────────────────┐
            │   backend            │ (API Server)
            └──────────┬───────────┘
                       │ K8s API
                       ▼
            ┌──────────────────────┐
            │  K8s Cluster         │
            │  k8s-cluster-setup   │ (Infrastructure)
            └──────────┬───────────┘
                       │
                       ▼
            ┌──────────────────────┐
            │ k8s-device-plugin    │ (GPU Scheduler)
            │ (Multi-GPU Support)  │
            └──────────────────────┘
                       │
                       ▼
            ┌──────────────────────┐
            │  4x RTX 5090 GPUs    │ (Per Node)
            └──────────────────────┘

Quick Start

1. Clone Repository (with all submodules)

# Clone main repo with all submodules
git clone --recurse-submodules <main-repo-url>
cd k8s

# Or if already cloned, initialize submodules
git submodule update --init --recursive

2. Update All Submodules to Latest

# Update all submodules to their latest remote state
git submodule update --remote --merge

# Or update individually
cd k8s-cluster-setup && git pull origin master && cd ..
cd k8s-device-plugin && git pull origin mps-individual-gpu && cd ..
cd frontend && git pull origin main && cd ..
cd backend && git pull origin main && cd ..

3. Deployment Order

Build K8s Cluster

cd k8s-cluster-setup
# Follow its README.md to deploy K8s cluster

Deploy GPU Device Plugin

cd k8s-device-plugin
# Build and deploy custom GPU plugin (mps-individual-gpu branch)
make build
kubectl apply -f deployments/

Deploy API Server
```
cd backend
# Deploy backend API service
```
Deploy Frontend UI
```
cd frontend
# Deploy Web UI
```

Important deployment notes (simple)

Start with the k8s-cluster-setup submodule. It creates the Kubernetes cluster and network settings that other components depend on.
Network and interface settings are environment specific. Update scripts in k8s-cluster-setup/scripts/ (interface name, CIDR, MASTER_IP, HARBOR_IP) before running them.
Backend requires an initial database and an admin account. Run the SQL seed at backend/infra/db/schema.sql. Provide a .env or K8s Secret with correct DB credentials so initialization can succeed.
Before deploying the backend, build and push the backend image to your registry. Use backend/scripts/build_image.sh but edit its registry/namespace/image variables to match your Harbor or registry.
Development manifests may use hostPath mounts. Do not use these in production. Replace hostPath with Secret, ConfigMap, or PVC for shared clusters.
Suggested backend apply order after image is available in the registry:
1. kubectl apply -f ca.yaml (if present)
2. kubectl apply -f go-api.yaml
3. kubectl apply -f postgres.yaml
Frontend development: run the dev server on the host for fast iteration (for example, use tmux and npm run dev). To deploy the frontend to K8s, build static files (npm run build), build a Docker image, push to your registry, and update the frontend manifest image.

A checklist of exact commands can be added if needed.

Development Workflow

Working Within Submodules

# Enter submodule
cd k8s-device-plugin

# Create new branch and develop
git checkout -b feature/new-feature
# ... make changes ...
git add .
git commit -m "Add new feature"
git push origin feature/new-feature

# Return to main repo and update submodule reference
cd ..
git add k8s-device-plugin
git commit -m "Update k8s-device-plugin to latest"

Keep All Modules Synchronized

# One-command update all submodules to latest
./scripts/update-all-submodules.sh

Hardware Requirements

Node Configuration: Each node requires 4x NVIDIA RTX 5090 GPUs
Network: High-speed interconnect network (recommended 10GbE or higher)
Storage: Shared storage system (e.g., NFS, Ceph)

Technology Stack

Infrastructure: Kubernetes, Containerd
GPU Management: NVIDIA Device Plugin (Custom), CUDA, MPS
Backend: Go (backend)
Frontend: TypeScript, React, Vite (frontend)
Build Tools: Docker, Helm

Maintainers

k8s-cluster-setup: @linskybing
k8s-device-plugin: @linskybing (mps-individual-gpu branch)
backend: @linskybing
frontend: @ted1204

License

Please refer to individual LICENSE files in each submodule.

FAQ

How to verify GPUs are configured correctly?

kubectl describe nodes | grep nvidia.com/gpu

How to test multi-GPU Pods?

Refer to example YAML files in k8s-device-plugin/tests/.

How to update a single submodule?

cd <submodule-name>
git pull origin <branch-name>
cd ..
git add <submodule-name>
git commit -m "Update <submodule-name>"

Last Updated: 2026-01-15

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
backend @ b14e1e6		backend @ b14e1e6
frontend @ 249c5b1		frontend @ 249c5b1
k8s-cluster-setup @ 96c86a5		k8s-cluster-setup @ 96c86a5
k8s-device-plugin @ b61e8c4		k8s-device-plugin @ b61e8c4
template		template
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Node K8s GPU Management System

System Architecture

1. k8s-cluster-setup

2. k8s-device-plugin

3. frontend

4. backend

Component Architecture Diagram

Quick Start

1. Clone Repository (with all submodules)

2. Update All Submodules to Latest

3. Deployment Order

Important deployment notes (simple)

Development Workflow

Working Within Submodules

Keep All Modules Synchronized

Hardware Requirements

Technology Stack

Maintainers

License

FAQ

How to verify GPUs are configured correctly?

How to test multi-GPU Pods?

How to update a single submodule?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Node K8s GPU Management System

System Architecture

1. k8s-cluster-setup

2. k8s-device-plugin

3. frontend

4. backend

Component Architecture Diagram

Quick Start

1. Clone Repository (with all submodules)

2. Update All Submodules to Latest

3. Deployment Order

Important deployment notes (simple)

Development Workflow

Working Within Submodules

Keep All Modules Synchronized

Hardware Requirements

Technology Stack

Maintainers

License

FAQ

How to verify GPUs are configured correctly?

How to test multi-GPU Pods?

How to update a single submodule?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages