Releases: ROCm/gpu-operator
gpu-operator-charts-v1.4.1
GPU Operator v1.4.1 Release Notes
The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 and Debian 12, and introduces the ability to build amdgpu kernel modules directly within air-gapped OpenShift clusters.
Important Notice
- New AMDGPU Driver Versioning Scheme
- Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR
spec.driver.version, users should reference the amdgpu driver version (e.g., "30.20") for ROCm 7.1 and later releases. For ROCm versions prior to 7.1, continue to use the ROCm version number (e.g., "6.4", "7.0"). Please refer to the AMD ROCm documentation for the driver version that corresponds to your desired ROCm release. All published amdgpu driver versions are available at Radeon Repository.
- Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR
Release Highlights
- OpenShift Platform Support Enhancements
- Build Driver Images Directly within Disconnected OpenShift Clusters
- Starting from v1.4.1, the AMD GPU Operator supports building driver kernel modules directly within disconnected OpenShift clusters.
- For Red Hat Enterprise Linux CoreOS (used by OpenShift), OpenShift will download source code and firmware from AMD provided amdgpu-driver images into their DriverToolKit and directly build the kernel modules from source code without dependency on lots of RPM packages.
- Cluster Monitoring Enablement
- The v1.4.1 AMD GPU Operator automatically creates the RBAC resources required by the OpenShift Cluster Monitoring stack. This reduces one manual configuration steps when setting up the OpenShift monitoring stack to scrape metrics from the device metrics exporter.
- Integration with OpenShift Cluster Observability Operator Accelerator Dashboard
- Starting with v1.4.1, the AMD GPU Operator automatically creates a
PrometheusRulethat translates key metrics into formats compatible with the OpenShift Cluster Observability Operator's accelerator dashboard, providing an improved out-of-the-box experience.
- Starting with v1.4.1, the AMD GPU Operator automatically creates a
- Build Driver Images Directly within Disconnected OpenShift Clusters
- Device-Metrics-Exporter Enhancements
- Enhanced Pod and Service Annotations
- Custom annotations can now be applied to exporter pods and services via the DeviceConfig CRD, providing greater flexibility in metadata management.
- Enhanced Pod and Service Annotations
- Test Runner Enhancements
- Level-Based and Partitioned GPU Test Recipe Support
- Test runner now supports level-based test recipes and partitioned GPU test recipes, enabling more granular and flexible GPU testing scenarios.
- Enhanced Test Result Events
- Test runner Kubernetes events now include additional information such as pod UID and test framework name (e.g., RVS, AGFHC) as event labels, providing more comprehensive test run information for improved tracking and diagnostics.
- Level-Based and Partitioned GPU Test Recipe Support
Fixes
- Node Feature Discovery Rule Fix
- Fixed the PCI device ID for the Virtual Function (VF) of MI308X and MI300X-HF GPUs
- Helm Chart default DeviceConfig Fix
- Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via
values.yamlor the--setoption.
- Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via
Known Limitations
- Test Runner
- RVS-generated
result.jsonfiles may contain redundant brackets at the end for newly introduced level-based recipes in v1.4.1, resulting in invalid JSON schema.
- RVS-generated
- Device Config Manager
- Memory partition operations may occasionally fail due to leaked device handlers that prevent the amdgpu driver from being unloaded when applying a new memory partition profile. This issue has been observed on Debian 12 with MI325X GPU when using the v1.4.1 Device Config Manager.
- Workaround: Reboot the affected worker nodes and retry the partitioning operation.
gpu-operator-charts-v1.4.0
GPU Operator v1.4.0 Release Notes
The AMD GPU Operator v1.4.0 adds MI35X platform support and updates all managed operands to ROCm 7 runtime libraries, aligning the full stack with the ROCm 7 release.
Release Highlights
-
Test Runner
- MI35X Support
- Introduced support for testing MI35X series GPUs.
- Expanded Test Framework
- Enabled execution of AMD GPU Field Health Check (AGFHC) test recipes.
- Note: Public test runner images support only ROCmValidationSuite (RVS) test recipes. Using AGFHC related features requires a licensed private test runner image. Contact AMD representatives for access.
- For details, refer to AGFHC documentation.
- MI35X Support
-
Device Config Manager
- MI35X Support
- Add support for MI35X series GPUs to enable the configuration of GPU partitions.
- MI35X Support
-
Device-Metrics-Exporter enhancements
-
MI35X Support
- Add support for MI35X series GPUs to enable the collection of GPU metrics.
-
Mask Unsupported Fields
- Platform-specific unsupported fields (amd-smi marked as N/A) will not be exported. Boot logs will indicate which fields are supported by the platform (logged once during startup).
-
New Profiler Fields
- New fields are added for better understanding of the application
-
Depricated Fields Notice
-
Following fields are depricated from 6.14.14 driver onwards
- GPU_MMA_ACTIVITY
- GPU_JPEG_ACTIVITY
- GPU_VCN_ACTIVITY
-
These fields are replaced by following fields
- GPU_JPEG_BUSY_INSTANTANEOUS
- GPU_VCN_BUSY_INSTANTANEOUS
-
-
Platform Support
- Validated for vanilla kubernetes 1.32, 1.33
Fixes
- Failed to load GPU Operator managed amdgpu kernel module on Ubuntu 24.04
- When users are using GPU Operator to build and manage the amdgpu kernel module, it may fail on the Ubuntu 24.04 worker nodes if the node doesn't have
linux-modules-extra-$(uname -r)installed. - This issue was fixed by this release,
linux-modules-extra-$(uname -r)won't be required to be installed on the worker node.
- When users are using GPU Operator to build and manage the amdgpu kernel module, it may fail on the Ubuntu 24.04 worker nodes if the node doesn't have
- Improved Test Runner Result Handling
- Previously, if some test cases in a recipe were skipped while others passed, the test runner would incorrectly mark the entire recipe as failed.
- Now, the test runner marks the recipe as passed if at least some test cases pass. If all test cases are skipped, the recipe is marked as skipped.
- Device Config Manager keeps retrying and waiting for unsupported memory partition type
- This issue has been fixed, currently if users provide unsupported memory partitions for the GPU model, DMC would immediately fail the workflow and won't keep retrying on unsupported memory partition.
Known Limitations
Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.
gpu_operator_helm_chart_v1.3.1
GPU Operator v1.3.1 Release Notes
The AMD GPU Operator v1.3.1 release extends platform support to OpenShift v4.19 for GPU partitioning on MI300 series GPUs with the Device Config Manager (DCM) component.
Release Highlights
-
Device Config Manager
- OpenShift Support for configuring GPU compute & memory partitions
- Add OpenShift platform support of Device-Config-Manager to enable the configuration of GPU partitions.
- OpenShift Support for configuring GPU compute & memory partitions
-
Device-Metrics-Exporter enhancements
-
New Metric Fields
- GPU_GFX_BUSY_INSTANTANEOUS, GPU_VC_BUSY_INSTANTANEOUS,
GPU_JPEG_BUSY_INSTANTANEOUS are added to represent partition activities at
more granular level. - GPU_GFX_ACTIVITY is only applicable for unpartitioned systems, user must
rely on the new BUSY_INSTANTANEOUS fields on partitioned systems.
- GPU_GFX_BUSY_INSTANTANEOUS, GPU_VC_BUSY_INSTANTANEOUS,
-
Health Service Config
- Health services can be disabled through configmap
-
Profiler Metrics Default Config Change
- The previous release of exporter i.e. v1.3.0's ConfigMap present under
example directory had Profiler Metrics enabled by default. Now, this is
set to be disabled by default from v1.3.1 onwards, because profiling is
generally needed only by application developers. If needed, please enable
it through the ConfigMap and make sure that there is no other Exporter
instance or another tool running ROCm profiler at the same time.
- The previous release of exporter i.e. v1.3.0's ConfigMap present under
-
Platform Support
- OpenShift 4.19 platform support has been added in this release.
Documentation Updates
Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.
Fixes
-
Test Runner pod restart failure with a GPU partition profile change
- Previously in v1.3.0 when users disabled test runner that cut down the ongoing test, then enabled it again with an underlying GPU partition profile change, the test runner would possibly fail to restart due to the device ID change caused by partition profile change.
- This has been fixed in v1.3.1 release. -
Device Config Manager memory partition failure when the driver was installed by Kernel Module Management (KMM) Operator
- Previously in v1.3.0 if users worker nodes have no inbox/pre-installed amdgpu driver (ROCm 6.4+) and users install the driver via the KMM operator, the memory partition configuration on Device Config Manager would fail
- Users who are using inbox/pre-installed amdgpu driver (ROCm 6.4+) won't be affected
- This issue has been fixed in v1.3.1 release.
gpu_operator_helm_chart_v1.3.0
GPU Operator v1.3.0 Release Notes
The AMD GPU Operator v1.3.0 release introduces new features, most notably of which is support for GPU partitioning on MI300 series GPUs with the new Device Config Manager component.
Release Highlights
-
Support for configuring, scheduling & monitoring GPU compute & memory partitions
A new component called Device-Config-Manager enables configuration of GPU partitions.
-
XGMI & PCIE Topology Aware GPU workload scheduling
Local topology-aware scheduling has been implemented to prioritize scheduling GPUs and GPU partitions from the same node or same GPU if possible.
-
Device-Metrics-Exporter enhancements
a. New exporter metrics generated by ROCm profiler
b. Ability to add user-defined Pod labels on the exported metrics
c. Ability to prefix all metric names with user defined prefix -
DeviceConfig creation via Helm install
Installing the GPU Operator via Helm will now also install a default DeviceConfig custom resource. Use
--set crds.defaultCR.install=falseflag to skip this during the Helm install if desired.
Platform Support
- No new platform support has been added in this release. While the GPU Operator now supports OpenShift 4.17, the newly introduced features in this release (GPU Health Monitoring, Automatic Driver & Component Upgrade, and Test Runner) are currently only available for vanilla Kubernetes deployments. These features are not yet supported on OpenShift, and OpenShift support will be introduced in the next minor release.
Documentation Updates
- Updated Release notes detailing new features in v1.3.0.
- New section added describing the new Device Config Manager component responsible for configuring GPU partitions on GPU worker nodes.
- Updated GPU Operator install instructions to include the default DeviceConfig custom resource that gets created and how to skip installing it if desired.
Known Limitations
- The Device Config Manager is currently only supported on Kubernetes. We will be adding a Debian package to support bare metal installations in the next release of DCM. For the time being
-
The Device Config Manager requires running a docker container if you wish to run it in standalone mode (without Kubernetes).
- Impact: Users wishing to use a standalone version of the Device Config Manager will need to run a standalone docker image and configure the partitions using config.json file.
- Root Cause: DCM does not currently support standalone installation via a Debian package like other standalone components of the GPU Operator. We will be adding a Debian package to support standalone bare metal installations in the next release of DCM.
- Recommendation: Those wishing to use GPU partitioning in a bare metal environment should instead use the standalone docker image for DCM. Alternatively users can use amd-smi to change partitioning modes. See amdgpu-docs documentation for how to do this.
-
The GPU Operator will report an error when ROCm driver install version doesn't match the version string in the Radeon Repo.
- Impact: The DeviceConfig will report an error if you specify
"6.4.0"or"6.3.0"for thespec.driver.version. - Root Cause: The version specified in the CR would still have to match the version string on Radeon repo.
- Recommendation: Although this will be fixed in a future version of the GPU Operator, for the time being you will instead need to specific
"6.4"or"6.3"when installing those versions of the ROCm amdgpu driver.
- Impact: The DeviceConfig will report an error if you specify
Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.
gpu_operator_helm_chart_v1.2.2
GPU Operator v1.2.2 Release Notes
The AMD GPU Operator v1.2.2 release introduces new features to support Device Metrics Exporter's integration with Prometheus Operator from ServiceMonitor custom resource and also introduces several bug fixes.
Release Highlights
-
Enhanced Metrics Integration with Prometheus Operator
This release introduces a streamlined method for integrating the metrics endpoint of the metrics exporter with the Prometheus Operator.
Users can now leverage the
DeviceConfigcustom resource to specify the necessary configuration for metrics collection. The GPU Operator will automatically read the relevantDeviceConfigand manage the creation and lifecycle of a corresponding ServiceMonitor custom resource.This automation simplifies the process of exposing metrics to the Prometheus Operator, allowing for easier scraping and monitoring of GPU-related metrics within your Kubernetes environment.
Documentation Updates
- Updated Release notes detailing new features in v1.2.2.
Known Limitations
Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.
Fixes
-
Node labeller failed to report node labels when users are using
DeviceConfigwithspec.driver.enable=falseand customized node selector inspec.selector[#183]- Issue: When users are using inbox driver, they will set
spec.driver.enable=falsewithin theDeviceConfigspec. If they are also using customized node selector inspec.selector, once node labeller was brought up its GPU properties labels are not showing up among Node resource labels. - Root Cause: When users are using
spec.driver.enable=falseand customized non-default selectorspec.selector, the operator controller manager is using the wrong selector to clean up node labeller's labels on non-GPU nodes. - Resolution: This issue has been fixed in v1.2.2. Users can upgrade to v1.2.2 and GPU properties node labels will show up once node labeller was brought up again.
- Issue: When users are using inbox driver, they will set
-
Users self-defined node labels under domain
amd.comare unexpectly removed [#151]- Issue: When users created some node labels under amd.com domain (e.g. amd.com/gpu: "true") for their own usage, it is unexpectly getting removed during bootstrapping.
- Root Cause:
- When node labeller pod launched it will remove all node labels within
amd.comandbeta.amd.comfrom current node then post the labels managed by itself. - When operator is executing the reconcile function, the removal of
DevicePluginor will remove all node labels underamd.comorbeta.amd.comdomain even if they are not managed by node labeller.
- When node labeller pod launched it will remove all node labels within
- Resolution: This issue has been fixed in v1.2.2 for both operator and node labeller side. Users can upgrade to v1.2.2 operator helm chart and use latest node labeller image then only node labeller managed labels will be auto removed. Other users defined labels under
amd.comorbeta.amd.comwon't be auto removed by operator or node labeller.
-
During automatic driver upgrade nodes can get stuck in reboot-in-progress
- Issue: When users upgrade the driver version by using
DeviceConfigautomatic upgrade feature withspec.driver.upgradePolicy.enable=trueandspec.driver.upgradePolicy.rebootRequired=true, some nodes may get stuck at reboot-in-progress state. - Root Cause:
- Upgrademgr was checking the generationID of
DeviceConfigto make sure any spec change during upgrade won't interfere existing upgrade. But if CR changes even for other parts of the device config spec which are unrelated to upgrade, this check will be a problem as new driver upgrade will not start for unrelated CR changes. - During the driver upgrade when node reboot happened, the controller manager pod could also get affected and rescheduled to another node. When it comes back, in the init phase, it checks for reboot-in-progress and attempts to delete reboot pod. But it is possible that reboot pod has terminated by then already.
- Upgrademgr was checking the generationID of
- Resolution: The controller manager's upgrade manager module implementation has been patched to fix this issue in release v1.2.2, by upgrading to new controller manager image this issue should have been fixed.
- Issue: When users upgrade the driver version by using
gpu_operator_helm_chart_v1.2.1
GPU Operator v1.2.1 Release Notes
The GPU Operator v1.2.1 release adds support for Red Hat OpenShift versions 4.16, 4.17 and 4.18. The AMD GPU Operator v1.2.1 has gone through a rigorous validation process and is certified for use on OpenShift. It can be deployed via the Red Hat Catalog.
Release Highlights
- The AMD GPU Operator has now been certified for use with Red Hat OpenShift v4.16, v4.17 and v4.18
New Platform Support
- Red Hat OpenShift 4.16-4.18
- Supported features:
- GPU Health Monitoring
- Automatic Driver Upgrades
- Test Runner for GPU Diagnostics
- Requirements: Red Hat OpenShift version 4.16, 4.17 or 4.18
- Supported features:
gpu_operator_helm_chart_v1.2.0
GPU Operator v1.2.0 Release Notes
The GPU Operator v1.2.0 release introduces significant new features, including GPU health monitoring, automated component and driver upgrades, and a test runner for enhanced validation and troubleshooting. These improvements aim to increase reliability, streamline upgrades, and provide enhanced visibility into GPU health.
Release Highlights
-
GPU Health Monitoring
- Real-time health checks via metrics exporter
- Integration with Kubernetes Device Plugin for automatic removal of unhealthy GPUs from compute node schedulable resources
- Customizable health thresholds via K8s ConfigMaps
-
GPU Operator and Automated Driver Upgrades
- Automatic and manual upgrades of the device plugin, node labeller, test runner and metrics exporter via configurable upgrade policies
- Automatic driver upgrades is now supported with node cordon, drain, version tracking and optional node reboot
-
Test Runner for GPU Diagnostics
- Automated testing of unhealthy GPUs
- Pre-start job tests embedded in workload pods
- Manual and scheduled GPU tests with event logging and result tracking
Platform Support
- No new platform support has been added in this release. While the GPU Operator now supports OpenShift 4.17, the newly introduced features in this release (GPU Health Monitoring, Automatic Driver & Component Upgrade, and Test Runner) are currently only available for vanilla Kubernetes deployments. These features are not yet supported on OpenShift, and OpenShift support will be introduced in the next minor release.
Documentation Updates
- Updated Release notes detailing new features in v1.2.0
- Continued effort to respond to all GitHub User reported Issues on ROCm/gpu-operator repo and able to resolve 5 of those issues with this new GPU Operator release [#2], [#23], [#25], [#30], [#55]
- Updated Known Issues and Limitations section to highlight limitations users should be aware of including 6 new issues and 1 fixed issue
- Updated Quick Start Guide for GPU Operator to make it easier for users to get started
- Revamped Driver Upgrade Guide that includes utilizing Automatic Upgrade Process (New feature in v1.2.0)
- New Documentation section on Upgrading the GPU Operator, as well as, Upgrading GPU Operator Components (new features for v1.2.0)
- New Metrics Exporter documentation on Health Checks feature (new feature for v1.2.0)
- New Documentation section for Test Runner (new feature for v1.2.0)
- New and updated GPU Operator examples in the GPU Operator repo including Test Runner job examples
- Whole site doc review conducted for the AMD GPU Operator Instinct docs site and many corrections made to outdated or incorrect documentation
Known Limitations
-
Incomplete Cleanup on Manual Module Removal
- Impact: When AMD GPU drivers are manually removed (instead of using the operator for uninstallation), not all GPU modules are cleaned up completely.
- Recommendation: Always use the GPU Operator for installing and uninstalling drivers to ensure complete cleanup.
-
Inconsistent Node Detection on Reboot
- Impact: In some reboot sequences, Kubernetes fails to detect that a worker node has reloaded, which prevents the operator from installing the updated driver. This happens as a result of the node rebooting and coming back online too quickly before the default time check interval of 50s.
- Recommendation: Consider tuning the kubelet and controller-manager flags (such as --node-status-update-frequency=10s, --node-monitor-grace-period=40s, and --node-monitor-period=5s) to improve node status detection. Refer to Kubernetes documentation for more details.
-
Inconsistent Metrics Fetch Using NodePort
- Impact: When accessing metrics via a Nodeport service (NodeIP:NodePort) from within a cluster, Kuberntes' built-in load balancing may sometimes route requests to different pods, leading to occasional inconsistencies in the returned metrics. This behavior is inherent to Kubernetes networking and is not a defect in the GPU Operator.
- Recommendation: Only use the internal PodIP and configured pod port (default: 5000) when retrieving metrics from within the cluster instead of the NodePort. Refer to the Metrics Exporter document section for more details.
-
Helm Install Fails if GPU Operator Image is Unavailable
- Impact: If the image provided via --set controllerManager.manager.image.repository and --set controllerManager.manager.image.tag does not exist, the controller manager pod may enter a CrashLoopBackOff state and hinder uninstallation unless --no-hooks is used.
- Recommendation: Ensure that the correct GPU Operator controller image is available in your registry before installation. To uninstall the operator after seeing a ErrImagePull error, use --no-hooks to bypass all pre-uninstall helm hooks.
-
Driver Reboot Requirement with ROCm 6.3.x
- Impact: While using ROCm 6.3+ drivers, the operator may not complete the driver upgrade properly unless a node reboot is performed.
- Recommendation: Manually reboot the affected nodes after the upgrade to complete driver installation. Alternatively, we recommend setting rebootRequired to true in the upgrade policy for driver upgrades. This ensures that a reboot is triggered after the driver upgrade, guaranteeing that the new driver is fully loaded and applied. This workaround should be used until the underlying issue is resolved in a future release.
-
Driver Upgrade Timing Issue
- Impact: During an upgrade, if a node's ready status fluctuates (e.g., from Ready to NotReady to Ready) before the driver version label is updated by the operator, the old driver might remain installed. The node might continue running the previous driver version even after an upgrade has been initiated.
- Recommendation: Ensure nodes are fully stable before triggering an upgrade, and if necessary, manually update node labels to enforce the new driver version. Refer to driver upgrade documentation for more details.
Fixes
- Driver Upgrade Failure with Exporter Enabled
- Previously, enabling the exporter alongside the operator caused driver upgrades to fail.
- Status: This issue has been fixed in v1.2.0.
gpu_operator_helm_chart_v1.1.0
GPU Operator v1.1.0 Release Notes
The GPU Operator v1.1.0 release adds support for Red Hat OpenShift versions 4.16 and 4.17. The AMD GPU Operator has gone through a rigourous validation process and is now certified for use on OpenShift. It can now be deployed via the Red Hat Catalog.
The latest AMD GPU Operator OLM Bundle for OpenShift is tagged with version v1.1.1 as the operator image has been updated to include a minor driver fix.
Release Highlights
- The AMD GPU Operator has now been certified for use with Red Hat OpenShift v4.16 and v4.17
- Updated documentation with installationa and configuration steps for Red Hat OpenShift
Platform Support
New Platform Support
- Red Hat OpenShift 4.16-4.17
- Supported features:
- Driver management
- Workload scheduling
- Metrics monitoring
- Requirements: Red Hat OpenShift version 4.16 or 4.17
- Supported features:
Known Limitations
- Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift
- Impact: Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall.
- Affected Configurations: This issue only affects Red Hat OpenShift
- Workaround: This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC:
-
List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to).
oc get nmc -A
-
Edit the NMC.
oc edit nmc <nmc name>
-
Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted.
gpu_operator_helm_chart_v1.0.0
AMD GPU Operator v1.0.0 Release Notes
This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.
Release Highlights
- Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes
- Customized scheduling of AMD GPU workloads within Kubernetes cluster
- Metrics and statistics monitoring solution for AMD GPU hardware and workloads
- Support specialized networking environment like HTTP proxy or Air-gapped network
Hardware Support
New Hardware Support
-
AMD Instinct™ MI300
- Required driver version: ROCm 6.2+
-
AMD Instinct™ MI250
- Required driver version: ROCm 6.2+
-
AMD Instinct™ MI210
- Required driver version: ROCm 6.2+
Platform Support
New Platform Support
- Kubernetes 1.29+
- Supported features:
- Driver management
- Workload scheduling
- Metrics monitoring
- Requirements: Kubernetes version 1.29+
- Supported features:
Breaking Changes
Not Applicable as this is the initial release.
New Features
Feature Category
-
Driver management
- Managed Driver Installations: Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes
- DeviceConfig Custom Resource: Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator
-
GPU Workload Scheduling
- Custom Resource Allocation "amd.com/gpu": After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node,
amd.com/gpu, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against - Assign Multiple GPUs: Users can easily specify the number of AMD GPUs required by each workload in the deployment/pod spec and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources
- Custom Resource Allocation "amd.com/gpu": After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node,
-
Metrics Monitoring for GPUs and Workloads:
- Out-of-box Metrics: Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume
- Custom Metrics Configurations: Users can utilize a configmap to customize the configuration and behavior of Device Metrics Exporter
-
Specialized Network Setups:
- Air-gapped Installation: Users can install the GPU Operator in a secure air-gapped environment where the Kubernetes cluster has no external network connectivity
- HTTP Proxy Support: The AMD GPU Operator supports usage within a Kubernetes cluster that is behind a HTTP Proxy. HTTPS support to be added in future release.
Known Limitations
-
GPU operator driver installs only DKMS package
- Impact: Applications which require ROCM packages will need to install respective packages.
- Affectioned Configurations: All configurations
- Workaround: None as this is the intended behaviour as other ROCm software packages should be managed inside the containers/workloads running on the cluster
-
When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install
- Impact: Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- Affected configurations: Nodes with driver version >= ROCm 6.2.x
- Workaround: Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
-
GPU Operator unable to install amdgpu driver if existing driver is already installed
- Impact: Driver install will fail if amdgpu in-box Driver is present/already installed
- Affected Configurations: All configurations
- Workaround: When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. Blacklist in-box driver so that it is not loaded or remove the pre-installed driver
-
When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server
- Impact: Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- Affectioned Configurations: All configurations
- Workaround: Restart the Device plugin pod deployed.
-
Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed
- Impact: Node upgrade will not proceed automatically and requires manual intervention
- Affected Configurations: All configurations
- Workaround: Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:
kubectl cordon <node-name>
-
When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module
-
Impact: Driver upgrade is blocked
-
Affectioned Configurations: All configurations
-
Workaround: Disable the Metrics Exporter to allow driver upgrade by updating the deviceconfig in the gpu-operator namespace to set metrics exporter to enabled to false:
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='json' -p='[{"op": "replace", "path": "/spec/metricsExporter/enable", "value": false}]'
-