Contents

Solution to cuda driver failure after kernel update

Contents

after a recent kernel update (presumably automatic via unattended-upgrades), cuda functionality on my server ceased functioning. the primary symptom was the following error when running nvidia-smi:

NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver ...

this indicates a mismatch between the installed nvidia driver and the current kernel version. the solution involves using dkms (Dynamic Kernel Module Support) to rebuild the nvidia driver against the new kernel.

solution

  1. install dkms (if not already installed):
sudo apt install dkms

dkms is a framework that allows drivers to be automatically rebuilt when the kernel is updated. this ensures compatibility between the driver and the running kernel.

  1. determine the installed nvidia driver version:

using ls -l /usr/src/ | grep nvidia might list multiple versions. a more reliable method is to use dpkg:

dpkg -l | grep nvidia-driver

this will output information about the installed nvidia driver packages, including the version number. look for the package name that corresponds to your installed driver (e.g., nvidia-driver-535). extract the version number from the package name (e.g., 535). if you’re unsure, check your distribution’s package manager or nvidia’s website for the correct driver version corresponding to your gpu.

  1. rebuild the nvidia driver using dkms:

replace <version> with the version number you identified in the previous step (e.g., 535):

sudo dkms status  # verify the module name and version (e.g., nvidia/535)
sudo dkms add -m nvidia -v <version>  #such as 535.171.04 or skip if already added
sudo dkms build -m nvidia -v <version>
sudo dkms install -m nvidia -v <version>

if any of these commands fail, carefully examine the error messages. common issues include missing kernel headers or build dependencies. consult your distribution’s documentation or online forums for troubleshooting assistance.

  1. reboot the server:
sudo reboot

rebooting is crucial for loading the newly built kernel module.

  1. verify driver functionality:

after rebooting, run nvidia-smi again. you should see output similar to this, indicating a successful driver installation:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04    Driver Version: 535.171.04    CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================+
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 20%   35C    P8    N/A /  N/A |    0MiB /  11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-----------------------------------------------------------------------------+