Bug 33678 - unable to use proprietary nvidia driver
Summary: unable to use proprietary nvidia driver
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 9
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-25 14:37 CEST by Tony Blackwell
Modified: 2025-02-09 23:22 CET (History)
5 users (show)

See Also:
Source RPM: nvidia-current-all-550.120-1.mga9.nonfree
CVE:
Status comment: For wiki (?) when we know more.


Attachments
journalctl of first boot after reconfig as per Comment 4 (138.36 KB, text/plain)
2024-10-26 00:10 CEST, Tony Blackwell
Details

Description Tony Blackwell 2024-10-25 14:37:01 CEST
Hardware context: 
CPU Intel Core I7-7700K
32Gb RAM
hard disk consists of 2 NVME drives in RAID0 for speed
Graphics nvidia GTX 1080Ti, XFCE disktop

Description of problem:
Uneventful as recently as nvidia 550.90.07, where nvidia-settings and cuda-z 'just work'

Recent updates seem to have resulted in
1. NVIDIA Settings can no longer see my GTX 1080Ti card
2. CUDA-Z can no longer talk to the card. 

 My first impression is I am prevented from using the proprietary driver for my 1080 Ti GPU.  Using nvidia 550.120-1.mga9.nonfree.

currently: /etc/X11/xorg.conf includes:

Section "Device"
    Identifier "device1"
    VendorName "NVIDIA Corporation"
    BoardName "NVIDIA GeForce 745 series and later"
    Driver "nouveau"
    Option "DPMS"
    Option "DynamicTwinView" "false"
    Option "AddARGBGLXVisuals"
EndSection
.

I open a terminal, su root, run XFdrake to completion, choosing the proprietary driver.  Subsequently the above Driver "nouveau" line in xorg.conf has changed to  Driver "nvidia".  Looks as expected.

Reboot.
Get message that system requires a re-boot for driver change.
Reboot again.
nvidia-settings can't see the 1080Ti, and cuda-x says change the driver.
looking at xorg.conf shows that my earlier change written there by XFdrake has been reverted by the reboot back to Driver "nouveau" which explains why cuda is non-functional.

Looks like a bug to me.

(FWIW the slightly later nvidia driver 550.127.05 on nvidia website today still lists the 1080Ti as supported, so that's not the issue.) 

So, something has changed recently.  Another M9 (testing) installation on same hardware, different partition, at nvidia 550.90.07 continues to work just fine.
Comment 1 Tony Blackwell 2024-10-25 14:43:50 CEST
There has been some discussion of this in QA.  I'll take the liberty of transcribing some discussion here - trust that is OK.

"That reads like "nokmsboot" isn't being added to the kernel options in Grub. Try running XFdrake again, and selecting the proprietary driver. Then when you reboot, press "e" during the delay in Grub. Look in the line for the kernel options for "nokmsboot" and if it's not there, add it and continue the boot.

If that works, you'll need to use MCC/Boot to make the change permanent. We had a problem with this years ago - Mageia 6? 5? - and the script in XFdrake was fixed to do it for the user. If that's not happening, it is indeed a bug.

TJ "
Comment 2 Tony Blackwell 2024-10-25 14:44:40 CEST
Also transcribed from QA discussion:


Seems pretty subdle and more complex.

The 'nokmsboot' was a string added to the grub cmdline, i.e. in /etc/default/grub by drakx11 when the nvidia driver is selected and installed for the first time. Sometimes drakx11 it's not rewriting properly such "nokmsboot" string (especially if you switch back and forth between drivers) but it's not easy to track when this happen, and that might be another bug. But the problem at the moment is not the 'nokmsboot' flag missed, because that flag is "not supported anymore", not by drakx11, but by kernel. The option of supporting nokmsboot was provided by a custom kernel patch. Such patch was removed long ago, way before mga9 releasing, by tmb (and also the patch was no longer applying). See:

https://svnweb.mageia.org/packages/cauldron/kernel/current/SPECS/kernel.spec?r1=1945648&r2=1945907

when we switched from 6.1.14 to 6.2.1 kernel (See Patch1500: drm-gpu-drm-treat-nokmsboot-as-nomodeset.patch).

Now what exactly happens? My thought is that when you switch by drakx11 to 'nvidia', it correctly configures all the stuff, but in your hardware at next boot, the machine is probably too slow (or the kernel booting too fast, which is the same) as it doesn't automatically loads the requested nvidia kernel modules "in time" (it should arrive at least at a point when X11 is started e.g. by sddm), because at that time in the booting timeline, for some reason the kernel is slow at loading the nvidia modules, and then the kernel autoprobes the nouveau kernel module earlier, and when the nouveau kernel module is loaded, the nvidia kernel module can't be loaded anymore at a later point, properly. In this situation, we have the system that tries to start X11 with nvidia, but fails, because blocked by nouveau. In this situation harddrake2 detects that something in starting X11 has "gone wrong", and intervenes, kicking out the nvidia configuration in /etc/X11/xorg.conf and falling back to nouveau always in /etc/X11/xorg.conf, which is were you see Driver "nouveau" line despite the previous nvidia choice. Note that this is not a matter of using /etc/X11/xorg.conf or not, that's still another case study.

A thing that can be tried here is to tell harddrake to not autoreconfigure xorg.conf when you (or drakx utils) change it. IIRC this could be done by editing /etc/sysconfig/harddrake2/service.conf and replacing AUTORECONFIGURE_RIGHT_XORG_DRIVER=yes with AUTORECONFIGURE_RIGHT_XORG_DRIVER=no. But that's shouldn't be resolutive anyway, it should only prevent the automodification of the xorg.conf to nouveau, not the loading of the nouveau module, though I've not received any feedback if this is working at preventing it or not.

Now why on some system with nvidia it happens and in some other don't? There are many systems where this not happens and 1080Ti is one of the cards well served by many drivers series (either proprietary or not).

An hypothesis (but don't take this as written in stone) is that that's depends on how fast is booting the system. On faster system with SSD/NVMe disks and faster processor would have the nvidia kernel modules loaded faster and properly, before the nouveau module tries to spot in.

What is needed? A 2nd mechanism is needed to prevent nouveau being loaded. Adding just "blacklist nouveau" in some /etc/modprobe.d/<file.conf> like typically found on plenty of tutorials (not ours) online is not enough, because the nouveau.ko kernel module is loaded earlier and faster than to arrive at that exclusion. There is needed something more "powerful". Actually the string more "powerful" to add to /etc/default/grub cmdline could be:

module_blacklist=nouveau nouveau.modeset=0 nouveau.noaccel=1

or some of them to be added to /etc/default/grub:GRUB_CMDLINE_LINUX_DEFAULT=... (and then update with upgrade-grub2). The module_blacklist= (note, *important*, is *module_blacklist=...*, not modules_blacklist=... nor module_blacklists=...) expect as list of arguments a list of comma separated kernel modules (in any order) to be excluded from being even loaded. The loading of such modules is prevented even at a later stage, so if later you do something like "modprobe nouveau" you get error: "modprobe: ERROR: could not insert 'nouveau': Operation not permitted". This is pretty different than 'blacklist <something>' in /etc/modprobe.d/*.conf. As an example you might try to modrobe the 'pcspkr' which is typically blacklisted in /etc/modprobe.d/blacklist-mga.conf, and you see that you can modprobe it (by root, of course) now.

If that option are working for you, we might try to see, in future, if that combination of strings might be added in some way to the drakx11 logic when configuring nvidia, to make things more robust. Here the problems arise with the multiargument flags. In bootloader.pm there is the subroutine bootloader::get_append_with_key(), which we might use, but here we have that 'module_blacklist=' might also already arrive containing a list of modules other than nouveau, like:

module_blacklist=mymodule1,mymodule2,mymodule3

and so on, so there should be a subs/logic able to add/remove the string back and forth, so to not touch all the other arguments.

G.
Comment 3 Tony Blackwell 2024-10-25 14:53:44 CEST
Three more brief comments/suggestions from QA discussion:
> Thank you for grabbing this and explaining.
>
> I think a bug should be opened for containing the knowledge above and disussion for implementation.
>
> /Morgan
>


Agreed. And we need to somehow solve this so that the drivers will work for all of our users without them needing to make custom changes like those listed above.

I'm not having the problem Tony is seeing, but my computer is using NVME drives. It will, unfortunately, be difficult to find people with affected hardware willing to test the solution that is found. (Other than Tony, of course...)

TJ 

Try changing AUTORECONFIGURE_RIGHT_XORG_DRIVER in
/etc/sysconfig/harddrake2/service.conf from yes to no, but do open a bug report,
and include whether or not the nvidia driver works ok with the change in the
service.conf file.

Regards, Dave Hodgins 


Tony: Back to my thoughts as OP.  Lots of good suggestions here which I've not explored tonight.

I don't understand why this has not been an issue on this hardware through M8 and then M9 until now.  Something has clearly changed with recent update.  I note the suggestions re nvme and timing issues, but the hardware hasn't changed.  I'll look at the suggestions folk have generously worked on in the light of a new day.  Thankyou all.
Tony
Marja Van Waes 2024-10-25 16:17:53 CEST

CC: (none) => kernel, marja11

Comment 4 Tony Blackwell 2024-10-25 23:48:40 CEST
in /etc/sysconfig/harddrake2/service.conf, changing AUTORECONFIGURE_RIGHT_XORG_DRIVER from yes to no has no effect.  

XFdrake still writes DRIVER "nvidia" to /etc/X11/xorg.conf, and the reboot process still changes it back to DRIVER "nouveau"
Comment 5 Tony Blackwell 2024-10-26 00:10:55 CEST
Created attachment 14724 [details]
journalctl of first boot after reconfig as per Comment 4

added graphics stuff in journalctl from first boot after reconfig as per Comment 4
Comment 6 Tony Blackwell 2024-10-26 00:19:22 CEST
whoops, dates are wrong in journalctl.  I'll get a driving lesson.
Comment 7 Tony Blackwell 2024-10-26 00:55:17 CEST
Hmmm, having troubledownsizing journalctl output to relevant.  Syntax suggestions?
Comment 8 Giuseppe Ghibò 2024-10-28 01:10:47 CET
Try this(In reply to Tony Blackwell from comment #4)
> in /etc/sysconfig/harddrake2/service.conf, changing
> AUTORECONFIGURE_RIGHT_XORG_DRIVER from yes to no has no effect.  
> 
> XFdrake still writes DRIVER "nvidia" to /etc/X11/xorg.conf, and the reboot
> process still changes it back to DRIVER "nouveau"

Try this:

Configure XFdrake to switch to nvidia proprietary as before. Wait it rebuilds the dkms modules. Then before rebooting edit the file /etc/default/grub, and add where there is the line GRUB_CMDLINE_LINUX_DEFAULT="..." the entries:

GRUB_CMDLINE_LINUX_DEFAULT="...<your old stuff> module_blacklist=nouveau  nouveau.modeset=0 nouveau.noaccel=1"

then save the file and run 'update-grub2'. Then reboot.

CC: (none) => ghibomgx

Comment 9 Morgan Leijström 2024-10-30 14:40:57 CET
Source RPM say .mga9, setting the bug as such.

---

(In reply to Tony Blackwell from comment #7)
> Hmmm, having troubledownsizing journalctl output to relevant.
> Syntax suggestions?

You will find if you search internet for journalctl. i.e to filter by time:
 journalctl --since=10:00:47 --until=10:01:54

CC: (none) => fri
Version: Cauldron => 9
Status comment: (none) => For wiki (?) when we know more.

Comment 10 Lewis Smith 2024-10-31 20:26:10 CET
Is this the same thing as Bug 33549 ?
Comment 11 Tony Blackwell 2024-11-01 06:01:27 CET
Sorry folk, there is going to be a 3 week hiatus in this from my end, going interstate tomorrow so no relevant hardware access.  I'll resume from 23rd Oct onwards,

Re comment 10, his is not the same as bug 33549 which relates to nvidia 470 and its support for old nvidia hardware - different chipset, different issues.  The 1080Ti is still very much supported at present by current nvidia drivers.

With thanks,  Tony
Comment 12 Lewis Smith 2024-11-18 21:39:02 CET
(In reply to Tony Blackwell from comment #0)
> Hardware context: 
> CPU Intel Core I7-7700K
> 32Gb RAM
> hard disk consists of 2 NVME drives in RAID0 for speed
> Graphics nvidia GTX 1080Ti, XFCE disktop
> 
The following point seems to have been overlooked:
>  My first impression is I am prevented from using the proprietary driver for
> my 1080 Ti GPU.  Using nvidia 550.120-1.mga9.nonfree.
> 
> Another M9 (testing) installation on
> same hardware, different partition, at nvidia 550.90.07 continues to work
> just fine.
Note *on same hardware*. It is just the nVidia version.

I see that we are now at nvidia-current-550.127.05-1.mga9.nonfree; obviously to try.
And being Mageia 9, can you Tony, if necessary, downgrade the nvidia package to nvidia-current-550.90.07-1.mga9.nonfree; or what happens with
nvidia-current-550.100-1.mga9.nonfree which has not been mentioned?

To summarise:
550.90.07 works OK
550.100 ?
550.120 Does NOT work
550.127.05 ?
Hoping this is not nonsense...

CC: (none) => lewyssmith
Assignee: bugsquad => kernel

Comment 13 Lewis Smith 2024-11-19 09:01:16 CET
(In reply to Lewis Smith from comment #12)
> Hoping this is not nonsense...
Alas, it is! Almost certainly barking up the wrong tree.

Try again:
> Another M9 (testing) installation on
> same hardware, different partition, at nvidia 550.90.07 continues to work
> just fine.
Note *on same hardware*.

What about the *kernel version* ?
It is a more likely factor. Please post this for the system which works and that which does not.

Also *systemd* which has been updated recently.
Same request.
Comment 14 Tony Blackwell 2025-02-09 22:19:54 CET
With the passage of time and updates this issue has resolved.
I'm unfortunately not able to identify where the problem actually was, but resolved.

Resolution: (none) => FIXED
Status: NEW => RESOLVED

Comment 15 Morgan Leijström 2025-02-09 22:29:41 CET
Thank you Tony for the closure :-)
Comment 16 Giuseppe Ghibò 2025-02-09 23:22:11 CET
So you can update to your own NVIDIA architecture, which was released a few weeks ago :-)

Note You need to log in before you can comment on or make changes to this bug.