diff --git a/docs/deploying-on-openshift.md b/docs/deploying-on-openshift.md new file mode 100644 index 0000000..cb8e9cd --- /dev/null +++ b/docs/deploying-on-openshift.md @@ -0,0 +1,138 @@ +# Deploying fake-gpu-operator on Openshift + +This document will guide you through deploying fake-gpu-operator on an OpenShift cluster with some basic validation. Unlike most Kubernetes based clusters, Openshift has tighter security controls on what pods or workloads can do. Since there is this delta, we think it would be helpful to have this document here for others looking to deploy it on Openshift. + +It must be noted that this operator has limitations on what can be simulated. Do NOT expect to be able to plug in and have it play nice for say something like vllm or llm-d gpu required workloads/requests. As of this document, the simulated GPUs do not support simulated inferencing. If you need that functionality, consider looking at [llm-d/llm-d simulated-accelerators](https://github.com/llm-d/llm-d/tree/main/guides/simulated-accelerators) and/or [llm-d/llm-d-inference-sim](https://github.com/llm-d/llm-d-inference-sim). + +Note: Comparable Openshift infrastructure and/or older OCP versions may work but have not been tested. +Note: The oc cli is used for this guide but kubectl should work as well. + +## Prerequisites + +Apart from what is already mentioned on the main README page, below are some further points to keep in mind. + +### Infrastructure Setup + +- AWS - An [ec2](https://aws.amazon.com/ec2/instance-types/m6a/) instance of m6a.4xlarge was used for the Openshift control plane nodes. The cluster was scaled up with 2 extra worker nodes for demonstration purposes. If your a Red Hat associate, partner, or customer you can provision through the demo redhat system. + +### Platform Setup +- OpenShift - This guide was tested on OpenShift 4.20. +- Cluster administrator privileges are required to grant appropriate privileges for some of the fake-gpu-operator component service accounts. + +## Deployment Steps + +1. After logging into your Openshift cluster, get the cluster node names and focus on the ones you want to be your simulated GPUs. I will select the workers. + ``` + $ oc get nodes + NAME STATUS ROLES AGE VERSION + ip-10-0-13-14.us-east-2.compute.internal Ready worker 21h v1.33.5 + ip-10-0-28-20.us-east-2.compute.internal Ready control-plane,master,worker 43h v1.33.5 + ip-10-0-30-218.us-east-2.compute.internal Ready worker 21h v1.33.5 + ``` +1. Make sure to label them. + ``` + oc label node ip-10-0-13-14.us-east-2.compute.internal run.ai/simulated-gpu-node-pool=default + oc label node ip-10-0-30-218.us-east-2.compute.internal run.ai/simulated-gpu-node-pool=default + ``` +1. Deploy the helm chart. You can get the particular version you want by looking at the fake-gpu-operator repository releases page. Make sure you drop the version prefix "v" when running the helm command. Overwrite the environment.openshift value from the values file so the approprate security context constraints can be configured. + ``` + helm upgrade -i gpu-operator oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace --version 0.0.64 --set environment.openshift=true + ``` +1. You should see the following helm output. + ``` + Release "gpu-operator" does not exist. Installing it now. + Pulled: ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator:0.0.64 + Digest: sha256:f3a96f26ebc3bd77a2c50c4f792c692064826b99906aead51720413e6936e08b + NAME: gpu-operator + LAST DEPLOYED: Wed Dec 10 13:30:57 2025 + NAMESPACE: gpu-operator + STATUS: deployed + REVISION: 1 + TEST SUITE: None + ``` +1. Validate the deployments and daemonsets are up. + ``` + $ oc get deploy,ds -n gpu-operator + NAME READY UP-TO-DATE AVAILABLE AGE + deployment.apps/gpu-operator 1/1 1 1 14m + deployment.apps/kwok-gpu-device-plugin 1/1 1 1 14m + deployment.apps/nvidia-dcgm-exporter 1/1 1 1 14m + deployment.apps/status-updater 1/1 1 1 14m + deployment.apps/topology-server 1/1 1 1 14m + + NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE + daemonset.apps/device-plugin 2 2 2 2 2 nvidia.com/gpu.deploy.device-plugin=true 14m + daemonset.apps/mig-faker 0 0 0 0 0 node-role.kubernetes.io/runai-dynamic-mig=true 14m + daemonset.apps/nvidia-dcgm-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm-exporter=true 14m + ``` +1. Save the following content as a yaml file named `gpu-test-pod.yaml`. + ``` + apiVersion: v1 + kind: Pod + metadata: + name: gpu-test-pod + namespace: gpu-operator + spec: + containers: + - name: gpu-container + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + resources: + limits: + nvidia.com/gpu: 1 + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + command: + - /bin/sh + - -c + - | + while true; do + echo "NODE_NAME=$NODE_NAME" + sleep 10 + done + ``` +1. Create the pod that "needs" a gpu to simulate scheduling onto a "gpu" node. + ``` + $ oc apply -f gpu-test-pod.yaml + pod/gpu-test-pod created + ``` +1. Confirm that the mock nvidia-smi command got injected into the pods runtime and that we get the default simulated `Tesla-K80` gpu info. + ``` + $ oc exec pod/gpu-test-pod -n gpu-operator -- nvidia-smi + Wed Dec 10 03:15:26 2025 + +------------------------------------------------------------------------------+ + | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | + +--------------------------------+----------------------+----------------------+ + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + +--------------------------------+----------------------+----------------------+ + | 0 Tesla-K80 Off | 00000001:00:00.0 Off | Off | + | N/A 33C P8 11W / 70W | 11441MiB / 11441MiB | 100% Default | + | | | N/A | + +--------------------------------+----------------------+----------------------+ + + +------------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + +------------------------------------------------------------------------------+ + | 0 N/A N/A 17 G /bin/sh-cwhile true; do | + | .. 11441MiB | + +------------------------------------------------------------------------------+ + ``` + + +## Tips +- Have atleast 2 nodes labelled as for some reason the nvidia-dcgm-exporter pods complain with the following strange messages: + ``` + 2025/12/09 05:16:40 Topology update not received within interval, publishing... + 2025/12/09 05:16:40 Error getting configmap: topology-ip-10-0-13-14.us-east-2.compute.internal + ``` +- If for any reason you need to remove the helm release execute the following. Replace with your version. + ``` + $ helm uninstall gpu-operator --namespace gpu-operator + release "gpu-operator" uninstalled + ``` \ No newline at end of file