Problem
A GPU-enabled instance does not seem to be able to use the device. The driver does not seem to be running. and when running "nvidia-smi" you get an error such as:
$> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
|
This usually happens after an update of the Operating System kernel, and requires a rebuild of the NVIDIA driver to be compatible with the new kernel.
Solution
Using the morpheus web portal:
- Navigate to the instance showing the problems.
- Click on ACTIONS - Run Workflow.
- Pick "Nvidia driver refresh" and click EXECUTE.
- Morpheus will show the progress of this operation, and after a few moments, the GPUs should be available again.
Once your instance is running, you can check wether your instance can see the GPU with: $> nvidia-smi
Tue Nov 17 15:20:38 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.87 Driver Version: 440.87 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID V100-4C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 304MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+ |
|
Related articles
