Bug 33375 - Update request: nvidia-newfeature
Summary: Update request: nvidia-newfeature
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 9
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: QA Team
QA Contact:
URL:
Whiteboard: MGA9-64-OK
Keywords: advisory, validated_update
Depends on:
Blocks:
 
Reported: 2024-07-06 15:50 CEST by Giuseppe Ghibò
Modified: 2024-07-29 20:27 CEST (History)
4 users (show)

See Also:
Source RPM: nvidia-newfeature-555.58.02-1.mga9.nonfree
CVE:
Status comment:


Attachments
Screenshot of text after suspend-resume, top part (152.37 KB, image/jpeg)
2024-07-24 13:26 CEST, Morgan Leijström
Details
Screenshot of text after suspend-resume, bottom part (112.27 KB, image/jpeg)
2024-07-24 13:28 CEST, Morgan Leijström
Details

Description Giuseppe Ghibò 2024-07-06 15:50:53 CEST
It's a bugfix releases, bugfixes:

https://www.nvidia.com/Download/driverResults.aspx/228410/en-us/
Comment 1 Lewis Smith 2024-07-08 20:53:09 CEST
Assigning to Drivers.

Assignee: bugsquad => kernel

Comment 2 Morgan Leijström 2024-07-09 00:34:27 CEST
mga9-64 initial tests: works except for resuming:

Sorry to say that on my system this version also hangs the system on resuming from suspend as described at https://bugs.mageia.org/show_bug.cgi?id=33316#c35, like that previous test version.

And that is for both 6.6.28 and 6.6.37 kernels.

CC: (none) => fri

Morgan Leijström 2024-07-09 00:42:20 CEST

Assignee: kernel => qa-bugs

Comment 3 katnatek 2024-07-09 04:00:51 CEST
Source RPMS
nvidia-newfeature-555.58.02-1.mga9.nonfree

Binaries RPMS in x86_64

Repository: 9-x86_64-nonfree-updates_testing
dkms-nvidia-newfeature-555.58.02-1.mga9.nonfree
nvidia-newfeature-all-555.58.02-1.mga9.nonfree
nvidia-newfeature-cuda-opencl-555.58.02-1.mga9.nonfree
nvidia-newfeature-devel-555.58.02-1.mga9.nonfree
nvidia-newfeature-doc-html-555.58.02-1.mga9.nonfree
nvidia-newfeature-lib32-555.58.02-1.mga9.nonfree
nvidia-newfeature-utils-555.58.02-1.mga9.nonfree
x11-driver-video-nvidia-newfeature-555.58.02-1.mga9.nonfree
katnatek 2024-07-10 00:01:36 CEST

Keywords: (none) => advisory

Comment 4 Thomas Andrews 2024-07-10 22:57:59 CEST
MGA9-64 Plasma, i5-7500, nvidia Quadro K620. 

First test, using desktop and server 6.6.28 kernels. Installed with server kernel using MCC. No installation issues. 

No apparent issues with either kernel except that auto-login is disabled, no matter what the setting in MCC is. Restoring nvidia-current restores auto-login.

CC: (none) => andrewsfarm

Comment 5 Thomas Andrews 2024-07-11 02:41:31 CEST
Also disabled auto-login for 6.6.37 kernels.
Comment 6 katnatek 2024-07-24 03:41:35 CEST
Ping, we let in the freezer this for more time?
Comment 7 Morgan Leijström 2024-07-24 08:22:18 CEST
As usual, I think we are too few testing this...
On my system it hangs on resuming from suspend, which is a regression from nvidia-current.

Anyway, now as time have gone by, lets also test it with new mesa and kernel-6.6.41 now in testing.

I will also try a couple reboots before suspending.

ref: https://bugs.mageia.org/show_bug.cgi?id=33426#c2

CC: (none) => ghibomgx

Comment 8 Giuseppe Ghibò 2024-07-24 11:22:43 CEST
Usually the aim of newfeature was to give the opportunity to use new and latest driver, but IMHO it could have some side effects on resuming or PM on certain hardware. Latest driver has a shorter cycle lifetime (usually close to 3 months).

Usually suspend problems doesn't occurs in newer systems with many PCI lanes. Beyond this even looking at the nvidia forums there are plenty of reports about problems (even for nvidia-current and not just nvidia-newfeature).

What is strange here is the report about the auto-login no longer working, so I wonder whether it could be a packaging problem, beyond driver themselves (which are closed sources and mostly we take what there is).

For the next driver rounds things seems to getting even more complicate. Upcoming driver 560.xx introduces the usage of the nvidia opensource kernel modules beside closed source ones (upstream advice to use those), which we don't have yet packaged. So there could be closed source kernel modules, open source kernel modules, ...
Comment 9 Morgan Leijström 2024-07-24 13:24:21 CEST
Testing with kernel-linus-6.6.41-1:

vt switching kind works minus that after some seconds after returning to Plasma, desktop goes black for a couple second.

From journal:

jul 24 13:08:15 svarten.tribun kernel: QSGRenderThread[11403]: segfault at 7f4af90b0c77 ip 00007f4d69fb7c28 sp 00007f4d35137a40 error 4 in libQt5Quick.so.5.15.7[7f4d69f14000+2da000] likely on CPU 2 (core 0, socket 0)

And plasma-plasmashell.service restarted.

---

Resuming after suspend shows full text screen with log, then hang, only exitable using REISUB (no reaction until the R)
- Will attach photos of it.
Comment 10 Morgan Leijström 2024-07-24 13:26:30 CEST
Created attachment 14602 [details]
Screenshot of text after suspend-resume, top part

Screenshot of text after suspend-resume, top part
Top line say nvidia exited with irqs disabled.
Comment 11 Morgan Leijström 2024-07-24 13:28:36 CEST
Created attachment 14603 [details]
Screenshot of text after suspend-resume, bottom part

I note sdc, where system is, have size changed to 0 (bottom lines)
Comment 12 Giuseppe Ghibò 2024-07-24 13:33:33 CEST
(In reply to Morgan Leijström from comment #9)
> Testing with kernel-linus-6.6.41-1:
> 
> vt switching kind works minus that after some seconds after returning to
> Plasma, desktop goes black for a couple second.
> 
> From journal:
> 
> jul 24 13:08:15 svarten.tribun kernel: QSGRenderThread[11403]: segfault at
> 7f4af90b0c77 ip 00007f4d69fb7c28 sp 00007f4d35137a40 error 4 in
> libQt5Quick.so.5.15.7[7f4d69f14000+2da000] likely on CPU 2 (core 0, socket 0)
> 
> And plasma-plasmashell.service restarted.

Maybe it triggers another bug in qt5/qtdeclarative5 5.15.7 which is not updated since a while. You might try if the same happens without plasma, e.g. in a plain desktop like icewm.
Comment 13 Morgan Leijström 2024-07-24 13:40:32 CEST
Yes... I might try that too.

Anyway... the suspend-resume hang seem not desktop related.

This, Comment 8, and Bug 33426#c6 yes it feels we need to be more of both devs and QA to be able to really provide this nvidia proprietary mess... :(
I believe in next computer I will not have nvidia GPU, so even less QA here.

For now I am off to verify our released nvidia470 with kernel 6.6.41.
Comment 14 Giuseppe Ghibò 2024-07-24 14:09:58 CEST
(In reply to Morgan Leijström from comment #13)
> Yes... I might try that too.
> 
> Anyway... the suspend-resume hang seem not desktop related.
> 
> This, Comment 8, and Bug 33426#c6 yes it feels we need to be more of both
> devs and QA to be able to really provide this nvidia proprietary mess... :(
> I believe in next computer I will not have nvidia GPU, so even less QA here.
> 

Let's see with 560.xx and kernel opensource modules (mostly in the opensource kernel modules code rely on card firmware). Also open source modules have a further restrictions on supported card, IIRC below GTX 16xx are not supported, see here:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/README.md

So I guess we would have a further dkms-nvidia-open, beside current...

From what I could test, much of the resuming and VT switching (apart one case upstream that we fixed with a patch) doesn't happens on a newer RTX card or Quadro, with newer motherboard. This of course doesn't means that bugs aren't there, because maybe the same it isn't triggered with some older kernel series, but probably it's where most of upstream developers are.

But also nasty problems occurs also on other chipset, I saw severe problems on Intel HD graphics (e.g. one works, one other don't) and on AMD cards (sometimes there is even difficult to define which card would support a certain features, I saw this with our ROCm tests). Only advantage is that there are a few less complicancy with external kernel modules, so in the end less packaging (and thus packaging bug), but for instance even on AMD we saw some proprietary package was required for full OpenCL/HIP compatibility.

Mostly it seems the more you dig with features, including accessing to full or advanced features (e.g. even color profiles, 10 bit, ROCm) and the more problems arises...
Comment 15 Thomas Andrews 2024-07-24 14:40:00 CEST
Just so you know, I will be going on vacation on Saturday, and will not be back to my Nvidia computer until the following Saturday, so I will be unable to test anything Nvidia-related until then.
Comment 16 Morgan Leijström 2024-07-24 15:29:43 CEST
I think we should release this 555.58.02-1 as is.

- we dont now if it is even possible to make it more reliable on the probably rather rare systems where it have problems.

And users may want this 555 series.

Giving this an OK knowing it is what it is, known to have problems on other distros too.

If you agree TJ, validate.

---

(In reply to Morgan Leijström from comment #13)
> For now I am off to verify our released nvidia470 with kernel 6.6.41.

Result: I see no problem there.

Bugs on this kernel version need be opened for QA.

---

Ah dkms-nvidia-open too.  Maybe you should already now open a bug set to tools maintainers to discuss how this should be handled in drakX11.
Maybe to develop in Cauldron, for mga10, and for now in mga9 only handle it manually and/or testing with a drakx11 in backport.

Whiteboard: (none) => MGA9-64-OK

Comment 17 Thomas Andrews 2024-07-24 15:50:48 CEST
(In reply to Morgan Leijström from comment #16)
> I think we should release this 555.58.02-1 as is.
> 
> - we dont now if it is even possible to make it more reliable on the
> probably rather rare systems where it have problems.
> 
> And users may want this 555 series.
> 
> Giving this an OK knowing it is what it is, known to have problems on other
> distros too.
> 
> If you agree TJ, validate.
> 
> ---
> 
Your opinion, Giuseppe?
Comment 18 Thomas Andrews 2024-07-24 15:54:36 CEST
(In reply to Giuseppe Ghibò from comment #14)
> Let's see with 560.xx and kernel opensource modules (mostly in the
> opensource kernel modules code rely on card firmware). Also open source
> modules have a further restrictions on supported card, IIRC below GTX 16xx
> are not supported, see here:
> 
> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/README.md
> 
> So I guess we would have a further dkms-nvidia-open, beside current...
> 
I don't see the Quadro K620 on that list, so it looks like I won't be able to test that one. I'm not going to buy yet another Nvidia card that I don't need just to test new drivers.
Comment 19 Giuseppe Ghibò 2024-07-24 21:12:41 CEST
(In reply to Thomas Andrews from comment #18)
> (In reply to Giuseppe Ghibò from comment #14)
> > Let's see with 560.xx and kernel opensource modules (mostly in the
> > opensource kernel modules code rely on card firmware). Also open source
> > modules have a further restrictions on supported card, IIRC below GTX 16xx
> > are not supported, see here:
> > 
> > https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/README.md
> > 
> > So I guess we would have a further dkms-nvidia-open, beside current...
> > 
> I don't see the Quadro K620 on that list, so it looks like I won't be able
> to test that one. I'm not going to buy yet another Nvidia card that I don't
> need just to test new drivers.

For the autologin problem, it could be a general problem of tools (e.g. packaging or deeper in the drakx utils), but at the moment I haven't found anything.

AFor 555.58.02 I think is stable enough, but IMHO it would be the latest of this series (555.45 series it's also bundled into latest upstream cuda-toolkit 12.5.0 [we don't have it yet]).

For the 560, actually there is 560.28.03-beta out, so probably there will be pretty soon a newer 560.xx series released as new "production" branch, and then the "new feature" upstream branch will be shut off for some months, with main interests going to newer 560.xx, like it was in the past. Of course since production branch is 6 months cycle and newfeature 3 (more or less), there will be always at some point a time where production version surpasses the new-feature version.

For 560.xx, the upstream binaries will *default* to opensource kernel modules (for the rest of GL libraries, x11-driver-video-nvidia, etc., are still required), but the proprietary kernel modules are still provided, as they were in the past [https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/].

For the K620, according to 560.28.03 docs, here:

https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/supportedchips.html

the Quadro K620 (which has PCI-id 10de:13bb) is still supported (by the proprieary kernel modules). What we might try next is to provide a testing dkms-nvidia-open package based on the nvidia opensource modules and see how is going beside older one.
Comment 20 Giuseppe Ghibò 2024-07-24 21:14:45 CEST
(In reply to Morgan Leijström from comment #16)

> Ah dkms-nvidia-open too.  Maybe you should already now open a bug set to
> tools maintainers to discuss how this should be handled in drakX11.
> Maybe to develop in Cauldron, for mga10, and for now in mga9 only handle it
> manually and/or testing with a drakx11 in backport.

yes, why not?
Comment 21 Thomas Andrews 2024-07-25 04:10:29 CEST
Neither of the issues we have found make the system unusable, and the newfeature driver *should* be considered, um, "experimental" by users, so I guess we can send it on. 

Giuseppe, I have only tried this driver with a Plasma system, so it's possible that autologin might still work with a DM other than sddm. 

Validating.

Keywords: (none) => validated_update
CC: (none) => sysadmin-bugs

Comment 22 Morgan Leijström 2024-07-25 11:42:47 CEST
Another point making the update less critical is that newfeature is never preselected by our tools i.e at install, so users who use it have actively chosen to, and knows more about drivers than users using the default.
Comment 23 Mageia Robot 2024-07-29 20:27:51 CEST
An update for this issue has been pushed to the Mageia Updates repository.

https://advisories.mageia.org/MGAA-2024-0166.html

Status: NEW => RESOLVED
Resolution: (none) => FIXED


Note You need to log in before you can comment on or make changes to this bug.