I run multiple instances in Azure with Standard_NV and Standard_NC series virtual machines with more than one GPU device assigned to the VM. Without the LIS RPMs, the VM doesn't see all NVIDIA GPU devices assigned to the guest, and if there's a kernel+LIS mismatch, the results can be unpredictable (e.g. 0 GPU devices, or 1 GPU device)
The patching strategy I follow is to adopt new kernels within a week of release, but the LIS packages are usually not available to match new kernels that quickly (I'm in CentOS 7 land)
I was using the OpenLogic repository to manage the LIS installation process, but that repository hasn't been updated since version 4.2.6, and it's no longer possible to reliably execute yum install kmod-microsoft-hyper-v microsoft-hyper-v to install the LIS rpms, because there is a specific set of RPMs for each small patch-level of every kernel.
That brings me to the impracticality of having to download the >400mb .tar or ISO and run the shell scripts to install this set of packages. (also makes it more complicated in an airgapped environment where http://aka.ms/LIS)
My questions about LIS and how it relates to Azure VMs running Linux...
- Can LIS be distributed as a single set of RPMs for each operating system distribution?
- If so, can the packages be added to the
microsoft-prod yum repository?
- Can you rely on dkms to compile automatically, based on a kernel change (e.g. 3.10.0-957.10.1 vs 3.10.0-957.12.1) so the current installation could be streamlined?
- Can you improve testing for Azure GPU VMs, ensuring that a Standard_NC12 host has 2 reported GPUs when LIS has been installed (e.g. LIS 4.3.0 was broken and only revealed 1 GPU)