Solution to cuda driver failure after kernel update
after a recent kernel update (presumably automatic via unattended-upgrades), cuda functionality on my server ceased functioning. the primary symptom was the following error when running nvidia-smi:
NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver ...this indicates a mismatch between the installed nvidia driver and the current kernel version. the solution involves using dkms (Dynamic Kernel Module Support) to rebuild the nvidia driver against the new kernel.
solution
- install dkms (if not already installed):
sudo apt install dkmsdkms is a framework that allows drivers to be automatically rebuilt when the kernel is updated. this ensures compatibility between the driver and the running kernel.
- determine the installed nvidia driver version:
using ls -l /usr/src/ | grep nvidia might list multiple versions. a more reliable method is to use dpkg:
dpkg -l | grep nvidia-driverthis will output information about the installed nvidia driver packages, including the version number. look for the package name that corresponds to your installed driver (e.g., nvidia-driver-535). extract the version number from the package name (e.g., 535). if you’re unsure, check your distribution’s package manager or nvidia’s website for the correct driver version corresponding to your gpu.
- rebuild the nvidia driver using dkms:
replace <version> with the version number you identified in the previous step (e.g., 535):
sudo dkms status # verify the module name and version (e.g., nvidia/535)
sudo dkms add -m nvidia -v <version> #such as 535.171.04 or skip if already added
sudo dkms build -m nvidia -v <version>
sudo dkms install -m nvidia -v <version>if any of these commands fail, carefully examine the error messages. common issues include missing kernel headers or build dependencies. consult your distribution’s documentation or online forums for troubleshooting assistance.
- reboot the server:
sudo rebootrebooting is crucial for loading the newly built kernel module.
- verify driver functionality:
after rebooting, run nvidia-smi again. you should see output similar to this, indicating a successful driver installation:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================+
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 20% 35C P8 N/A / N/A | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-----------------------------------------------------------------------------+