Skip to content

Set proper path for MPS shm dir mount#978

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_gke_shm_dir
Apr 4, 2026
Merged

Set proper path for MPS shm dir mount#978
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_gke_shm_dir

Conversation

@kasia-kujawa
Copy link
Copy Markdown
Contributor

@kasia-kujawa kasia-kujawa commented Apr 2, 2026

Fixes #974

This unblocks usage of MPS on GPUs which support pre-Volta MPS (e.g. NVIDIA P4) on GKE clusters and it doesn't change default configuration.

The MPS control daemon template hardcoded /driver-root/dev/shm as the container mount path for the MPS shm directory. This works when the daemon runs inside a chroot (standard NVIDIA DRA driver install, GPU Operator),
but fails on GKE COS where the daemon runs directly in the container namespace, expecting /dev/shm.

Change was also tested on NVIDIA T4 which supports Volta MPS.

  1. volumeMounts on EKS, MPS control daemon uses chroot
      volumeMounts:
        - name: driver-root
          mountPath: /driver-root
        - name: mps-shm-directory
          mountPath: /driver-root/dev/shm
        - name: mps-pipe-directory
          mountPath: /driver-root/tmp/nvidia-mps
        - name: mps-log-directory
          mountPath: /driver-root/var/log/nvidia-mps
        - name: kube-api-access-b7kx9
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  1. volumeMounts on GKE, MPS control daemon doesn't use chroot
      volumeMounts:
        - name: driver-root
          mountPath: /driver-root
        - name: mps-shm-directory
          mountPath: /dev/shm
        - name: mps-pipe-directory
          mountPath: /driver-root/tmp/nvidia-mps
        - name: mps-log-directory
          mountPath: /driver-root/var/log/nvidia-mps
        - name: kube-api-access-7pbnm
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          readOnly: true
          recursiveReadOnly: Disabled

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shivamerla
Copy link
Copy Markdown
Contributor

Good catch! The dev root changes as per the driver installation directory in other cases(host installed driver or the driver container), but not for GKE.

Copy link
Copy Markdown
Contributor

@visheshtanksale visheshtanksale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

Comment thread cmd/gpu-kubelet-plugin/main.go Outdated
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

I have another idea for fixing this issue, shm mount path can be set in this way:

func setMpsShmMountPath(hostDriverRoot string) string {
       for _, sh := range []string{"/bin/sh", "/usr/bin/sh"} {
               if _, err := os.Stat(filepath.Join(hostDriverRoot, sh)); err == nil {
                       return filepath.Join(hostDriverRoot, "dev", "shm")
               }
       }
       return "/dev/shm"
}

and

func (m *MpsControlDaemon) Start(ctx context.Context, config *configapi.MpsConfi
                MpsLogDirectory:                 m.logDir,
                MpsImageName:                    m.manager.config.flags.imageName,
                FeatureGates:                    featuregates.ToMap(),
-               MpsShmMountPath:                 m.manager.mpsShmMountPath,
+               MpsShmMountPath:                 setMpsShmMountPath(m.manager.hostDriverRoot),
        }


This will mirror the logic in the template https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/476b4787f1f75bb4a9c0b8e20604125536514b8c/templates/mps-control-daemon.tmpl.yaml#L32

If you think that it is better idea I can test it tomorrow.

@shivamerla
Copy link
Copy Markdown
Contributor

Yes, validating whether driverRoot is actually a proper root filesystem seems like a better approach than introducing yet another Helm variable for this.

@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 2, 2026
@kasia-kujawa kasia-kujawa changed the title Make MPS shm dir mount path configurable Set proper path for MPS shm dir mount Apr 3, 2026
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_gke_shm_dir branch from 5dd6ae8 to 91329cb Compare April 3, 2026 09:56
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 3, 2026
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

@shivamerla approach was changed and it is ready for review.
I tested this on EKS and GKE.

Comment thread cmd/gpu-kubelet-plugin/sharing_test.go Outdated
Comment thread cmd/gpu-kubelet-plugin/sharing.go Outdated
Comment thread cmd/gpu-kubelet-plugin/sharing_test.go Outdated
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_gke_shm_dir branch from 91329cb to aa57f8f Compare April 3, 2026 17:26
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2026
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

@shivamerla this is now ready for next review iteration

Copy link
Copy Markdown
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shivamerla
Copy link
Copy Markdown
Contributor

@dims please review and provide approval as well.

@dims
Copy link
Copy Markdown
Member

dims commented Apr 4, 2026

thanks for the very focused fix @kasia-kujawa

@dims
Copy link
Copy Markdown
Member

dims commented Apr 4, 2026

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 4, 2026
@dims
Copy link
Copy Markdown
Member

dims commented Apr 4, 2026

@kasia-kujawa do you mind fixing up the linting issue?

run golangci-lint
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint config path] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint config verify] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint run  -v --timeout 5m] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Error: cmd/gpu-kubelet-plugin/sharing.go:59:2: Comment should end in a period (godot)
  	// driverRootMountDir is the directory where the driver root is mounted inside the kubelet plugin container
  	^
  1 issues:
  * godot: 1

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_gke_shm_dir branch from aa57f8f to da1274b Compare April 4, 2026 04:21
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2026
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

@kasia-kujawa do you mind fixing up the linting issue?

run golangci-lint
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint config path] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint config verify] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Running [/home/runner/golangci-lint-2.11.4-linux-amd64/golangci-lint run  -v --timeout 5m] in [/home/runner/work/nvidia-dra-driver-gpu/nvidia-dra-driver-gpu] ...
  Error: cmd/gpu-kubelet-plugin/sharing.go:59:2: Comment should end in a period (godot)
  	// driverRootMountDir is the directory where the driver root is mounted inside the kubelet plugin container
  	^
  1 issues:
  * godot: 1

@dims I added the missing dot, hope that now it will be ok 🤞 😄

@dims
Copy link
Copy Markdown
Member

dims commented Apr 4, 2026

/approve
/lgtm

🤞🏾 @kasia-kujawa :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, kasia-kujawa, shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6ac32c2 into kubernetes-sigs:main Apr 4, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing support for preVolta MPS?

6 participants