Skip to content

nvidia-modprobe to potentially early out when nvidia blacklisted (Wayland + driver init issue) #5

@tim-rex

Description

@tim-rex

I've been exploring the purpose of nvidia-modprobe recently, and the implications for anyone using a dual-gpu setup and occasionally needing to blacklist the nvidia drivers. I'm using Wayland exclusively.

It's my understanding that nvidia-modprobe is provided as a fallback mechanism to ensure the nvidia driver is initialised with root priveleges (should it not already be properly initialised). The mechanism for calling nvidia-modprobe appears to be triggered by the nvidia libraries themselves when they are invoked by the relevant ICD
eg:

libnvidia-egl-gbm.so
libGLX_nvidia.so
libnvidia-egl-wayland.so

I've found that even when the nvidia drivers themselves are blacklisted, any program that tries to invoke or interrogate the ICD's for available devices causes nvidia-modprobe to be called (which in turns, attempts to modprobe nvidia as root)

Unfortunately, modprobe isn't the quickest in town and it takes a while for it to fail when the nvidia drivers are blacklisted (close to 1 second in my testing).

The problem is compounded by diagnostic tools such as inxi
For example, inxi -Fxz will repeatedly poll the ICD layer (approximately 33 times), which in turn loads the nvidia shared libraries (33 times) which triggers nvidia-modprobe (33 times)

This chain of events takes approximately 30 seconds to complete, while my journal logs shows (correctly) that Module nvidia is blacklisted (33 times).

This isn't the end of the world, though I've tried to mitigate the issue as follows:

Workaround
It's been suggested that I should be able to move nvidia-modprobe out of the way, short circuiting this chain of events somewhat. This does have the desired effect when the nvidia drivers are blacklisted

Problem
This has a side effect when the nVidia drivers are not blacklisted.
Specifically, despite the nvidia module being present and accounted for (via lsmod) it seems the appropriate device files have not been created (or the driver otherwise not fully initialised).

This is evidenced by the likes of eglinfo / vulkaninfo not showing the nVidia device whatsoever.

This can be rectified by one of the following approaches

  • Manually run the renamed nvidia-modprobe
  • Run vulkaninfo as root
  • Run nvidia-debugdump --list as root

Theory
I believe that this isn't an issue for X11 users, as the Xorg service runs as root and thus has no trouble when the nvidia shared libraries are instantiated (thus, the driver fully initialises without need for the nvidia-modprobe fallback mechanism.

For GDM and Wayland users, this isn't the case.. since these services do not run with superuser priveleges, the nvidia drivers will ultimately be loaded without special priveleges and will try to initiate the fallback mechanism by default. That obviously does not work if nvidia-modprobe cannot be found


So, to restate the problem (with the above taken into account)...

A linux system running Wayland without nvidia-modprobe will be unable to initialise the nVidia device without user intervention


Potential paths forward

  1. Accept that when the nvidia device is blacklisted, nvidia-modprobe will trigger a modprobe any time a userspace application tries to query or use the ICD's available - and that this may not be immediate.
  2. Accept that the removal of nvidia-modprobe will prevent proper initialisation of nVidia devices under Wayland

or we could consider a check within nvidia-modprobe (or indeed the shared libraries/drivers themselves) such that:

  1. Have nvidia-modprobe proactively check if the nvidia drivers are blacklisted before calling out to /sbin/modprobe and fail fast if that is the case
  2. Have the nvidia shared libraries proactively check if the nvidia drivers are blacklisted before attempting the fallback nvidia-modprobe mechanism

# 1 is a minor irritation (it drove me to research this issue)
# 2 could be scripted around via user code or udev rules, but doesn't help the wider community.

Perhaps # 3 or # 4 could be considered, if it doesn't introduce too much complexity?


Background:
My specific setup includes a GTX 960 with drivers 545.29.06
I've been testing across both Arch Linux and Fedora Linux (same drivers + kernel). It's worth noting that on Arch I'm using regular kernel modules, while Fedora uses akmods. I do not observe any difference in behaviour between the two.

I'm also running with an AMD RX 580
For development purposes, I frequently switch between nvidia, nouveau and amdgpu drivers using boot time kernel parameters to blacklist as appropriate.

Related forum posts here and here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions