Bug 24546 - CUDA does not work without x11-driver
Summary: CUDA does not work without x11-driver
Status: NEW
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-21 22:36 CET by Richard Walker
Modified: 2023-06-23 09:08 CEST (History)
4 users (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments
nvidia-cuda-toolkit-headless package (1.49 KB, text/x-matlab)
2019-05-20 17:03 CEST, Giuseppe Ghibò
Details
spec changes found necessary for functional CUDA/OpenCL (2.37 KB, patch)
2019-05-24 00:19 CEST, Richard Walker
Details | Diff
CUDA - Successful result (1.28 KB, text/plain)
2019-05-24 00:21 CEST, Richard Walker
Details
OpenCL-Successful result (8.70 KB, text/plain)
2019-05-24 00:23 CEST, Richard Walker
Details
journalctl -b | grep -i nvidia (1.44 KB, text/plain)
2019-05-29 19:52 CEST, Richard Walker
Details
Similarly filtered output from 2nd test machine journal (1.96 KB, text/plain)
2019-05-29 19:58 CEST, Richard Walker
Details

Description Richard Walker 2019-03-21 22:36:11 CET
Description of problem:
Trying to use GPU-accelerated processing with an Nvidia graphics card in a system with a non-nvidia screen driver is made more difficult than it should be by bundling CUDA and possibly OpenCL libraries in the same packages as OpenGL-related libraries.

This forces the system to be mis-configured by the nvidia gl config file in order to make the CUDA (and possibly also OpenCL) libraries visible to programs such as Blender, by hiding the Mesa GL libraries needed for the display.

The only way I have found to avoid this problem is to delete all libraries from /usr/lib64/nvidia-current which look like they might not be needed for CUDA.

A proper solution would be to re-package the components in a way to avoid inappropriate groupings of on-screen-v-off-screen processing libraries, perhaps creating a set of GPU-processing component rpms which would NOT require an nvidia X server as a dependency (because it isn't) and would allow the nvidia kernel module used without destroying a Mesa GL setup for, for example, amdgpu systems.


How reproducible:

Steps to Reproduce:
1.Configure a system to use Mesa and amdgpu. Check that it works as expected using (for example) glxspheres64.

2. Install dkms-nvidia-current, nvidia-cuda-toolkit, nvidia-cuda-toolkit-devel nvidia-current-cuda-opencl and nvidia-current-devel

3.Observe that additional nvidia rpms have been installed as dependencies/requires

4. Verify that glxspheres64 no longer runs at all.

5. Run lgconfig -p and observe that Mesa GL libraries are now _masked_ by the useless nvidia ones.
Comment 1 Richard Walker 2019-03-21 22:41:02 CET
At the very least there should be a clearly documented method published in, for example, nvidia-cuda-toolkit package to guide the average user through a very complicated and unintuitive process involving initrd re-building, kernel module blacklisting and library deletions.
Marja Van Waes 2019-03-24 11:08:06 CET

Assignee: bugsquad => kernel
CC: (none) => marja11

Comment 2 Richard Walker 2019-04-23 12:45:38 CEST
https://bugs.mageia.org/show_bug.cgi?id=22862 is a much older version of the same, or VERY similar bug. The only significant difference I see is that in my case I wish to use the accelerated 3D graphics capability (and potentially OpenCL too) of my AMD GPU.

-
Richard Walker 2019-04-26 14:09:19 CEST

Summary: Nvidia rpms enforce spurious dependency on x11-driver-video-nvidia-current and nvidia-current-devel when using only CUDA and other free screen drivers => CUDA does not work without x11-driver

Comment 3 Giuseppe Ghibò 2019-05-20 17:02:26 CEST
The mechanism is to expose the /usr/lib64/nvidia-current libraries trough ld.so.conf.d pluging, and also with update-alternatives (e.g. for nvidia-smi) when the nvidia driver is configured trough the system utils for display output. In this case the OpenGL system libraries are those coming from nvidia, not mesa.
Tweaking manually for having CUDA working when the X11 card is not also an nvidia card is possible, but probably will lead to unconfigured stuff at every upgrade or every time the system utils are called.

I'll try to add this simple package "nvidia-cuda-toolkit-headless", so that cuda libraries are available also to the system when the X11 card is not an nvidia one. In this case you need to install all the nvidia utils (important is that the nvidia kernel drivers are correctly built), but in the end switch to the AMD (or other) card (in this case the nvidia OpenGL libraries are not exposed to the system libs).

CC: (none) => ghibomgx

Comment 4 Giuseppe Ghibò 2019-05-20 17:03:44 CEST
Created attachment 11005 [details]
nvidia-cuda-toolkit-headless package

Package to expose cuda libraries when the first head is not an nvidia card
Comment 5 Richard Walker 2019-05-22 00:25:11 CEST
Apologies for the delay, Giuseppe. You caught me in the middle of trying to set up another headless CUDA GPU. At first I thought I could use your spec file to build a package to help make the job easier, but it would not work on the machine in question, or on my two, due to the requirement to have the x11-driver-video-nvidia-current package installed.

I have worked out an approach which almost works perfectly. I install all of the nvidia stuff which does not pull in the X11 driver. This means that the nvidia-smi, nvidia-cuda-mps-control, nvidia-cuda-mps-server and nvidia-modprobe commands are not available. These four appear to be essential requirements and nvidia-smi will not run without the other three.

I then download x11-driver-video-nvidia-current and nvidia-current-devel packages and move all of the files I need from these into the appropriate locations. 

With all the files in the correct places in /usr/bin, /usr/lib64/nvidia-current and /etc/ld.so.conf.d I was able to reboot to check it would work.

The first problem was that I discovered three nvidia kernel modules had been loaded despite the blacklist entries in /etc/modprobe.d, but the AMD screen setup had not been messed up. Partial success.

I later discovered that the boot time on this box had extended by nearly two minutes and that its xorg.conf file had been re-written by something to add a section for the nvidia card! This is despite there being NO x11 driver available for it and ALL kernel modules (nouveau and nvidia) are blacklisted.

I have tried manually removing the nvidia kernel modules and re-building initrd in order to get the machine to boot quickly and allow the operator to manually load the nvidia-current kernel module when it is needed for Blender, but I have not yet succeeded.

Using the spec file you proposed I see that I would have to start with a messed up system which is already fighting me to make the nvidia card a screen driver. The benefit of your spec file appears to be the way it symlinks the CUDA/OpenCL stuff to /usr/lib64 and thus dodges the problem of the OpenGL takeover, but the bigger and less tractable problem is how to stop some script somewhere from taking control over my x11 configuration.

Effectively, my solution also eliminates the OpenGL issue and makes a serious attempt to prevent the rogue script (or whatever it is) from changing screen settings, or waste boot time trying to do this.

When I get the chance I will try building a package to your spec and see what it really does. I have one machine available for testing this, but at the moment it is busy being my remote connection to my brother's computer.
Comment 6 Richard Walker 2019-05-23 03:27:20 CEST
The good news is that using your symlink-generating spec has proved successful but with snags. It is too late now to go into the details. Not all libraries needed for a working CUDA and OpenCL are included and it needs a few more links in /usr/bin.

There is also a minor issue with an extra ".so" appearing in the symlinks created in lines 44-45, but that is easily fixed. Just make the first suffix a simple ".1" and remove the ".so" from the second one completely.

When I have figured out what I had to add to get it running I will attach a suitable diff. Unfortunately I have hit a bigger problem - purely coincidental I hope.

Blender rendering has gone crazy again with render tiles appearing and disappearing while the render continues and continuing to blink on and off in some pattern when I click on the rendered image. I can mitigate this by turning of GLSL screen drawing (it is normally chosen as the method by the "automatic" setting) but it is not a complete solution. 

I don't know yet what is causing it as I changed so much on the testbed machine to try out your headless CUDA hack; I updated all packages and that brought me a new kernel, a new nvidia stack, and a complete Mesa update. I haven't found out yet how much of this I can revert to try to locate the source of the problem. I have a strong suspicion it is Mesa, but I really hope not!...

I'll have to get back to this tomorrow.
Comment 7 Richard Walker 2019-05-24 00:19:09 CEST
Created attachment 11021 [details]
spec changes found necessary for functional CUDA/OpenCL

I am happy to use this version on a computer where I have already solved the problem of the corrupted display config. I still don't know how to stop the first re-boot from taking control over my xorg.conf file. The best defence seems to be to keep a known good backup and repeatedly replacing the auto-generated one with the good one until something somewhere finally gets the message and stops doing it. 

It is also useful to strip any nvidia kernel modules from the initrd to prevent things like nvidia-drm from loading. When the machine boots cleanly and my display config is correct (this can take a while and a lot of trial and error) it is only necessary to do a "modprobe nvidia-uvm" to start using GPU acceleration in Blender or Darktable.

Blacklisting kernel modules appears not to have the expected effect.
Comment 8 Richard Walker 2019-05-24 00:21:00 CEST
Created attachment 11022 [details]
CUDA - Successful result
Comment 9 Richard Walker 2019-05-24 00:23:29 CEST
Created attachment 11023 [details]
OpenCL-Successful result

Obviously it is still necessary to do a bit of manual setup, but this seems to be just put the nvidia.icd file in /etc/OpenCL/vendors
Comment 10 Thomas Backlund 2019-05-26 17:58:34 CEST
Please remove any nvidia packages and manual customizations you have done.

Then install  nvidia-current-cuda-opencl-430.14-3.mga7 from nonfree updates_testing (currently building)

it will pull in nvidia-current-utils that now carries nvidia-smi and nvidia-persistenced

Does it work OOB without needing any manual configuration ?

CC: (none) => tmb

Comment 11 Thomas Backlund 2019-05-26 21:38:23 CEST
Gah, c/p errors in post scripts..

Should be fixed in nvidia-current-430.14-4.mga7 currently building
Comment 12 Richard Walker 2019-05-27 22:27:58 CEST
Roger that. I will strip a test machine and rebuild as you suggest, but I am just back from a road trip to the other site where I had remotely installed a CUDA headless setup which was still being plagued by nonsensical messages about screen driver incompatibilities during the actual boot process. 

It will likely be Tuesday evening after work before I can tackle test.

Thank you for your help in this.
Comment 13 Richard Walker 2019-05-28 01:30:36 CEST
Method:

stage 1
I uninstalled all nvidia packages
Checked for loaded kernel modules and found/removed nvidia-uvm and nvidia
Finally executed dracut -f and re-booted

stage 2
Checked on reboot that no nvidia kernel modules were loaded.
Ran updatedb and searched for residual nvidia components
Remaining nvidia-named files are documents or parts of free packages or firmware files
Reboot machine

stage 3
Checked boot log : journalctl -b | grep -i nvidia
Found repeated references to a systemd nVidia persistence service. This service was discovered in stage 2 above but ignored:
rpm -qf /usr/lib/systemd/system/nvidia-persistenced.service --qf '%{SOURCERPM}'
file /usr/lib/systemd/system/nvidia-persistenced.service is not owned by any package
Deleted the service file. I'm not sure where it came from, but I reckon if I really need it, it will re-appear by the same magic.
Reboot

stage 4
Enabled Nonfree updates testing repo
Updated all repositories
Selected nvidia-current-cuda-opencl:
To satisfy dependencies, the following package(s) also need to be installed:

- dkms-nvidia-current-430.14-4.mga7.nonfree.x86_64
- nvidia-current-utils-430.14-4.mga7.nonfree.x86_64
OK so far:~)
The dkms module build completed successfully then dracut -f presumably put them all in my initrd. As I only need nvidia-uvm and nvidia-current I hope the other two will not cause problems on reboot for my screen setup. We'll see...
Reboot

stage 5
The reboot was clean with absolutely no interference with my screen setup. Can you hear me cheering?
Checked for loaded modules: just nvidia-current
Run nvidia-smi -l 1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I downloaded the x11-driver-video-nvidia-current package and copied nvidia-modprobe to /usr/bin and retried nvidia-smi -l 1. This time it worked.
Blender (2.79 and 2.80) reports no usable GPUs for CUDA.

OK, I have poked around for an hour or so and I cannot make any progress. 
Tomorrow I will try to do a detailed comparison of a working system and the one under test. Nevertheless, overall, the experience has been a good one. The single most valuable improvement achieved is that there has been absolutely no attempt made by the system to mess with xorg.conf! I'll be back
Comment 14 Richard Walker 2019-05-28 01:48:29 CEST
That was surprisingly quick! "All" I had to do was install nvidia-cuda-toolkit-devel. 

I have reverted all the changes I made to the files list supplied in your rebuilt packages and re-booted.

The kernel modules loaded at startup are nvidia-uvm and nvidia-current. The nvidia-smi utility runs and the Blenders both report usable GPUs. After manually transferring nvidia.icd from /etc/nvidia-current to /etc/OpenCL/vendors, clinfo reports two usable OpenCL platforms. 

All appears well, but I got here by a roundabout route so I will try again. Tomorrow I will strip this machine back and precisely follow your instructions from comment 10, then I will install nvidia-cuda-toolkit-devel
 and get a genuine OOTB result for you.

Thank you
Comment 15 Richard Walker 2019-05-28 03:12:48 CEST
Why do I keep fooling myself? I am not going to bed until I have checked a new fresh install.

I completed stages 1 to 3 in comment 13.
Next I select nvidia-cuda-toolkt-devel to install. It pulls in nvidia-cuda-toolkt
Next I select nvidia-current-cuda-opencl which pulls in the dkms package and your new nvidia-current-utils.

The following 5 packages are going to be installed:

- dkms-nvidia-current-430.14-4.mga7.nonfree.x86_64
- nvidia-cuda-toolkit-10.1.168-1.mga7.nonfree.x86_64
- nvidia-cuda-toolkit-devel-10.1.168-1.mga7.nonfree.x86_64
- nvidia-current-cuda-opencl-430.14-4.mga7.nonfree.x86_64
- nvidia-current-utils-430.14-4.mga7.nonfree.x86_64

That was stage 4

On re-boot the nvidia-smi program failed to run in the same way as in the previous stage 5. In fact, the whole of stage 5 applies as before with one glorious exception; this time, copying a handy version of nvidia-modprobe to /usr/lib64/nvidia-current/bin and symlinking it to /usr/bin produce BOTH a working nvidia-smi AND working CUDA in Blenders. The only other manual intervention required is to copy nvidia.icd to the correct place.

I call that a good result. Perfect would be nicer:~)
Comment 16 Richard Walker 2019-05-29 03:19:21 CEST
I repeated the process on a second machine with similar results.

The chief difference was that no kernel modules loaded on first reboot.

A simple modprobe nvidia-current fixed that and I was able to get a copy of nvidia-modprobe from a "--no-install" copy of x11-driver-video-nvidia-current-430.14-4.mga7 to put in /usr/bin.

Ultimately the result was the same with both nvidia-smi and CUDA support in Blender working correctly.

I will find out tomorrow if the kernel module persists in not loading on boot.It may have something to do with that systemd service I found on the other machine , or maybe not.

I noted that the 430.14-4 rpms are no longer in updates nonfree testing.
Comment 17 Thomas Backlund 2019-05-29 08:40:50 CEST
(In reply to Richard Walker from comment #15)


> On re-boot the nvidia-smi program failed to run in the same way as in the
> previous stage 5. In fact, the whole of stage 5 applies as before with one
> glorious exception; this time, copying a handy version of nvidia-modprobe to
> /usr/lib64/nvidia-current/bin and symlinking it to /usr/bin produce BOTH a
> working nvidia-smi AND working CUDA in Blenders. The only other manual
> intervention required is to copy nvidia.icd to the correct place.
> 
> I call that a good result. Perfect would be nicer:~)


Ok, so some minor tweaks are still needed

(In reply to Richard Walker from comment #16)

> I noted that the 430.14-4 rpms are no longer in updates nonfree testing.

Yeah, I moved them to release in time for RC builds
Comment 18 Richard Walker 2019-05-29 19:52:58 CEST
Created attachment 11035 [details]
journalctl -b | grep -i nvidia

This is taken from the machine which boots with a loaded nvidia-current kernel module. Any CUDA or OpenCL operations will work immediately after boot.
Comment 19 Richard Walker 2019-05-29 19:58:40 CEST
Created attachment 11036 [details]
Similarly filtered output from 2nd test machine journal

There was no module loaded during boot. I executed 

modprobe nvidia-current

at 18:44:20
Comment 20 Thomas Backlund 2019-06-10 22:02:35 CEST
There is now a  nvidia-current-430.26-1.mga7.nonfree that should hopefylly fix the last bits
Comment 21 Richard Walker 2019-06-10 23:43:38 CEST
Thank you Thomas, I know you have been busy and I appreciate this. I am preparing the first test machine as before, with detail changes.

I updated all repos and started mcc, selecting "All updates" for display. 
Next I clicked the "Select All" button and manually de-selected the new nvidia dkms etc.
The kernel install and dkms build of the "old" nvidia kernel module completed and I THEN remembered to remove the existing nvidia rpms.
I thought that would not only remove the just-built nvidia kernel module, but also remove it from the initrd, if it were there. It was, and it didn't.
I am still not sure I understand why a kernel module not needed during boot (and maybe not at all in a session) should be included in the initrd, but there it is. Ah well.

I am doing more cleanup now so that I can try the updated packages as a "fresh" install.
Comment 22 Richard Walker 2019-06-11 00:40:44 CEST
Test #1 completed successfully. I followed the process in comment 13 with only minor tweaks; I no longer need the updates testing repo, obviously, and this time I selected for install both nvidia-current-cuda-opencl AND nvidia-cuda-toolkit-devel to make sure I had a fully functional CUDA which Blender could find. 

On the first boot after installation everything just worked! nvidia-smi AND clinfo produced a working result (thank you for the nvidia.icd symlink) and Blender found the GTX 1050 at the first attempt. 

I will try again on the GTX 960 machine and complete this report.
Comment 23 Richard Walker 2019-06-11 01:15:40 CEST
Test #2 completed with good results. The strangest thing, though not by any means a show-stopper, is that once more it did not load the nvidia kernel module on boot. I have no idea why these two machines should behave differently in this; they have the same motherboard, CPU and both run fully up-to-date MGA7.

The only difference I can see without digging too deeply is that Test #1 machine has a Grub boot loader while the Test #2 machine uses Grub2.

Anyway, this latest set of nvidia packages works beautifully, thank you Thomas. If you want "perfection" then the only thing I could think of adding would be a meta-package which would be called something like "nvidia-current-cuda-opencl-headless" which would have only dependency on nvidia-current-cuda-opencl AND nvidia-cuda-toolkit-devel. That would help to avoid some head-scratching when a new user discovers not everything works with just the nvidia-current-cuda-opencl set installed.

I consider that this bug may now be closed with big grins all round.
Comment 24 Morgan Leijström 2023-06-23 09:08:48 CEST
Above sounds great, but as you say in comment 1 a short guide would be good.

I assume it is still working?

Can you summarise something we can put into 
https://wiki.mageia.org/en/Setup_the_graphical_server#NVIDIA_CUDA.2C_OpenCL_and_more
?

CC: (none) => fri


Note You need to log in before you can comment on or make changes to this bug.