Bug 28632 - kernel-desktop-5.10.25-1.mga7 and nvidia-current-460.67-1.mga7 is a no go! SDDM doesn't work
Summary: kernel-desktop-5.10.25-1.mga7 and nvidia-current-460.67-1.mga7 is a no go! SD...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 7
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-23 14:23 CET by Adelson Oliveira
Modified: 2021-03-28 03:14 CEST (History)
5 users (show)

See Also:
Source RPM: kernel-desktop-5.10.25-1.mga7
CVE:
Status comment:


Attachments

Description Adelson Oliveira 2021-03-23 14:23:56 CET
Description of problem:
Updated last night kernel => 5.10.25-1.mga7 and nvidia => 460.67-1.mga7 and the system does not go. All I get is a black screen with some messages (unfortunately I did not get the messages).

I'm using old kernel 5.10.20-2.mga7 and new nvidia-current-460.67-1.mga7 and it is working fine.

I've already removed and reinstalled nvidia modules from new kernel tree and it did not help.

Version-Release number of selected component (if applicable):
kernel-desktop-5.10.25-1.mga7

How reproducible:


Steps to Reproduce:
1.
2.
3.
Comment 1 Adelson Oliveira 2021-03-23 14:36:30 CET
About the messages: they are the same as those one sees when booting kernel-desktop-5.10.20-2.mga7 that works fine. Then, all I can report is that sddm screen does not show up, it is a no go.
Comment 2 Lewis Smith 2021-03-23 19:24:37 CET
Thank you for the report, and sorry about the inconvenience.
It is fortunate that you have found that using the older kernel
5.10.20-2.mga7 with the new nvidia-current-460.67-1.mga7 works for you.

Assigning this to the kernel team.

Assignee: bugsquad => kernel

Comment 3 Adelson Oliveira 2021-03-27 04:30:21 CET
There are lines in dmesg that may belong to this problem:

[   23.554664] NVRM: API mismatch: the client has the version 460.67, but
               NVRM: this kernel module has the version 460.56.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[   23.554940] NVRM: API mismatch: the client has the version 460.67, but
               NVRM: this kernel module has the version 460.56.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[   23.555245] NVRM: API mismatch: the client has the version 460.67, but
               NVRM: this kernel module has the version 460.56.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[   23.555575] NVRM: API mismatch: the client has the version 460.67, but
               NVRM: this kernel module has the version 460.56.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
Comment 4 Dave Hodgins 2021-03-27 05:02:16 CET
Which nvidia packages do you have installed? On the only system I have that
has a nvidia gpu, I have ...
[dave@x8t ~]$ rpm -qa|grep nvidia
dkms-nvidia-current-460.67-1.mga8.nonfree
nvidia-current-doc-html-460.67-1.mga8.nonfree
nvidia-cuda-toolkit-11.2.0-8.mga8.nonfree
nvidia-current-utils-460.67-1.mga8.nonfree
x11-driver-video-nvidia-current-460.67-1.mga8.nonfree
lib64nvidia-egl-wayland1-1.1.5-3.mga8
nvidia-current-cuda-opencl-460.67-1.mga8.nonfree
[dave@x8t ~]$ uname -r
5.10.25-desktop-1.mga8

CC: (none) => davidwhodgins

Comment 5 Guy Gallagher 2021-03-27 05:07:16 CET
I had this same issue when I upgraded earlier this week. Solved by booting into old kernel and then reinstalling (remove/add) the 5.10.25 kernel packages (desktop and dev). Am happily running on the latest packages now:

[gallaghg@Wolverine ~]$ uname -a
Linux Wolverine 5.10.25-desktop-1.mga7 #1 SMP Sat Mar 20 17:16:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[gallaghg@Wolverine ~]$ dmesg|grep -i nvid
[    3.246716] nvidia: loading out-of-tree module taints kernel.
[    3.246725] nvidia: module license 'NVIDIA' taints kernel.
[    3.271293] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[    3.271644] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.471437] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.67  Thu Mar 11 00:11:45 UTC 2021

Not sure why the setup failed initially, however.

CC: (none) => guy.gallagher

Comment 6 Thomas Backlund 2021-03-27 10:20:46 CET
(In reply to Adelson Oliveira from comment #3)
> There are lines in dmesg that may belong to this problem:
> 
> [   23.554664] NVRM: API mismatch: the client has the version 460.67, but
>                NVRM: this kernel module has the version 460.56.  Please
>                NVRM: make sure that this kernel module and all NVIDIA driver
>                NVRM: components have the same version.
> [   23.554940] NVRM: API mismatch: the client has the version 460.67, but
>                NVRM: this kernel module has the version 460.56.  Please
>                NVRM: make sure that this kernel module and all NVIDIA driver
>                NVRM: components have the same version.
> [   23.555245] NVRM: API mismatch: the client has the version 460.67, but
>                NVRM: this kernel module has the version 460.56.  Please
>                NVRM: make sure that this kernel module and all NVIDIA driver
>                NVRM: components have the same version.
> [   23.555575] NVRM: API mismatch: the client has the version 460.67, but
>                NVRM: this kernel module has the version 460.56.  Please
>                NVRM: make sure that this kernel module and all NVIDIA driver
>                NVRM: components have the same version.

This is a transaction ordering issue that happends some times when kernel and nvidia drivers gets updated at the same time...


The new kernel got installed and the old dkms-nvidia-current 460.56 rebuilt its module. then kernel posttrans created the initrd, adding the "old" nvidia module.

In next transaction the nvidia driver updated itself from 460.56 to 460.67 causing kernel vs userspace mismatch...

when this happends, if you get as far as command prompt you should be able to resolve it with a simple "dracut -f" to get newest nvidia driver in initrd and reboot...

In worst case you might need to re-trigger dkms build before creating the initrd with:

/usr/sbin/dkms_autoinstaller start
dracut -f

or if you want to do it while runnning an older kernel:

/usr/sbin/dkms_autoinstaller start 5.10.25-desktop-1.mga8
dracut -f /boot/initrd-5.10.25-desktop-1.mga8.img 5.10.25-desktop-1.mga8

(just change the "5.10.25-desktop-1.mga8" to match the kernel you want to trigger build for)
Comment 7 Morgan Leijström 2021-03-27 12:05:16 CET
FWIW no problems here mga7-64 SDDM Plasma

[morgan@svarten ~]$ uname -a
Linux svarten.tribun 5.10.25-desktop-1.mga7 #1 SMP Sat Mar 20 17:16:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[morgan@svarten ~]$ rpm -qa|grep nvidia
dkms-nvidia-current-460.67-1.mga7.nonfree
x11-driver-video-nvidia-current-460.67-1.mga7.nonfree
nvidia-current-cuda-opencl-460.67-1.mga7.nonfree
nvidia-cuda-toolkit-10.1.168-1.2.mga7.nonfree
nvidia-current-utils-460.67-1.mga7.nonfree
nvidia-current-doc-html-460.67-1.mga7.nonfree

$ sudo journalctl -b | grep NVRM
mar 25 23:44:19 svarten.tribun kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.67  Thu Mar 11 00:11:45 UTC 2021

CC: (none) => fri

Comment 8 Adelson Oliveira 2021-03-27 15:29:55 CET
Yes, I've already noticed that this problem seemed related to updating kernel and nvidia too close in time but I could not confirm this cause-effect relationship. It is good to know!

Problem solved with

dracut -f

in recovery mode.

Just to report, the option 

dkms_autoinstaller start 5.10.25-desktop-1.mga7

in a session with the older kernel didn't work. The output is that the development package for this kernel is not installed although it is installed in fact.

Thanks for the information and the solution!

Should I mark this as solved or is this made by managers of the bugzilla?
Comment 9 Aurelien Oudelet 2021-03-27 16:01:46 CET
(In reply to Adelson Oliveira from comment #8)
> Yes, I've already noticed that this problem seemed related to updating
> kernel and nvidia too close in time but I could not confirm this
> cause-effect relationship. It is good to know!
> 
> Problem solved with
> 
> dracut -f
> 
> in recovery mode.
> 
> Just to report, the option 
> 
> dkms_autoinstaller start 5.10.25-desktop-1.mga7
> 
> in a session with the older kernel didn't work. The output is that the
> development package for this kernel is not installed although it is
> installed in fact.
> 
> Thanks for the information and the solution!
> 
> Should I mark this as solved or is this made by managers of the bugzilla?

No, thanks reporting this is fixed.

CC: (none) => ouaurelien
Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 10 Morgan Leijström 2021-03-27 16:39:57 CET
(In reply to Thomas Backlund from comment #6)
>
> This is a transaction ordering issue that happends some times when kernel
> and nvidia drivers gets updated at the same time...

Thats sad.

Do we have a bug report for that?
Comment 11 Thomas Backlund 2021-03-27 17:56:35 CET
(In reply to Morgan Leijström from comment #10)
> (In reply to Thomas Backlund from comment #6)
> >
> > This is a transaction ordering issue that happends some times when kernel
> > and nvidia drivers gets updated at the same time...
> 
> Thats sad.
> 
> Do we have a bug report for that?

There is nothing that can be fixed...
Comment 12 Giuseppe Ghibò 2021-03-27 23:04:26 CET
I think I've found the problem.

There are two nvidia set of installed modules, one in:

/var/lib/dkms/nvidia-current/460.67-1.mga8.nonfree/$(uname -r)/x86_64/module/nvidia-current.ko.xz

and one in:

/usr/lib/modules/$(uname -r)/dkms/drivers/char/drm/nvidia-current.ko.xz

The first one is generated by the command:

/usr/sbin/dkms --rpm_safe_upgrade build -m nvidia-current -v 460.67-1.mga8.nonfree

and

the second one is generated by the command:

/usr/sbin/dkms --rpm_safe_upgrade install -m nvidia-current -v 460.67-1.mga8.nonfree --force

but only *if and only if* the first command is successful, otherwise the second command it's skipped. Both commands are in the dkms-nvidia-current %postinstall scriptlets.

I guess for  some reason (machine hang, non 0 return code exit, etc.) the 2nd command was not executed and you remain with an incomplete installation and two module sets mismatching. If you do:

modinfo /var/lib/dkms/nvidia-current/460.67-1.mga8.nonfree/$(uname -r)/x86_64/module/nvidia-current.ko.xz | grep ^version

and

modinfo /usr/lib/modules/$(uname -r)/dkms/drivers/char/drm/nvidia-current.ko.xz | grep ^version

you'll get probably 460.67 in the first case and 460.57 in the second. In that case, completing the "interrupted" installation stage with:

/usr/sbin/dkms --rpm_safe_upgrade install -m nvidia-current -v 460.67-1.mga8.nonfree --force -k $(uname -r)

should fix the problem. However this is a manual fixing.

I'll dig to see if something can be done to get stuff more robust (or less weak).

CC: (none) => ghibomgx

Comment 13 Adelson Oliveira 2021-03-28 03:14:52 CEST
Well, as I reported above, I did 

# dracut -f

in recovery mode and now SDDM goes fine. But, anyway, I've tried both modinfo commands as suggested by Giuseppe Ghibò and got only 460.67. That may not surprise since dracut solved the problem ...

Thanks any way

Note You need to log in before you can comment on or make changes to this bug.