Bug 33316 - Update request: nvidia-current-550.90.07-1.mga9.nonfree
Summary: Update request: nvidia-current-550.90.07-1.mga9.nonfree
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 9
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: QA Team
QA Contact:
URL:
Whiteboard: MGA9-64-OK
Keywords: advisory, validated_update
Depends on:
Blocks:
 
Reported: 2024-06-19 14:32 CEST by Giuseppe Ghibò
Modified: 2024-07-08 23:14 CEST (History)
4 users (show)

See Also:
Source RPM: nvidia-current-550.90.07-1.mga9.nonfree, ldetect-lst-0.6.58-1.mga9
CVE:
Status comment:


Attachments

Description Giuseppe Ghibò 2024-06-19 14:32:14 CEST
It's a bugfix releases, bugfixes:

https://www.nvidia.com/Download/driverResults.aspx/226768/en-us/

There is also ldetect-lst which refreshes the pci-table of the newer driver.
Comment 1 Morgan Leijström 2024-06-20 00:08:53 CEST
I assume you meant to set this to QA.

Please provide packages list.

ldetect-lst is still in build queue.
Is it supposed to go in this bug or a separate one?

That said I am already using 550, used OK on host with all three 6.6.28 kernels as host when testing VirtualBox update Bug 33273.

I updated using drakrpm, manually selecting all nvidia 550 packages I want:

- dkms-nvidia-current-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-cuda-opencl-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-doc-html-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-utils-550.90.07-1.mga9.nonfree.x86_64
- x11-driver-video-nvidia-current-550.90.07-1.mga9.nonfree.x86_64

kmods got built for the running kernel, and then automatically during boot when switching kernels.

CC: (none) => fri
Assignee: bugsquad => qa-bugs

Comment 2 Brian Rockwell 2024-06-20 04:17:03 CEST
MGA9-64, ‎AMD Ryzen 5 2600, Nvidia 1650 super, GNOME

The following 4 packages are going to be installed:

- dkms-nvidia-current-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-cuda-opencl-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-utils-550.90.07-1.mga9.nonfree.x86_64
- x11-driver-video-nvidia-current-550.90.07-1.mga9.nonfree.x86_64

1.1MB of additional disk space will be used.


---- rebooted

$ nvidia-smi
Wed Jun 19 20:55:04 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650 ...    Off |   00000000:07:00.0  On |                  N/A |
| 35%   37C    P8             11W /  100W |      96MiB /   4096MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3189      G   /usr/libexec/Xorg                              46MiB |
|    0   N/A  N/A      3328      G   /usr/bin/gnome-shell                           43MiB |
+-----------------------------------------------------------------------------------------+

browser working
calc working


works for me

CC: (none) => brtians1

Comment 3 Giuseppe Ghibò 2024-06-20 10:34:14 CEST
(In reply to Morgan Leijström from comment #1)

> I assume you meant to set this to QA.
> 
> Please provide packages list.
> 
> ldetect-lst is still in build queue.
> Is it supposed to go in this bug or a separate one?

ldetect-lst is supposed to go in this bug.
katnatek 2024-06-20 21:05:58 CEST

Keywords: (none) => advisory

Comment 4 katnatek 2024-06-23 03:46:14 CEST
SRPMS
ldetect-lst-0.6.58-1.mga9
nvidia-current-550.90.07-1.mga9.nonfree

RPMS in:

9/x86_64/nonfree/updates_testing

dkms-nvidia-current-550.90.07-1.mga9.nonfree
nvidia-current-all-550.90.07-1.mga9.nonfree
nvidia-current-cuda-opencl-550.90.07-1.mga9.nonfree
nvidia-current-devel-550.90.07-1.mga9.nonfree
nvidia-current-doc-html-550.90.07-1.mga9.nonfree
nvidia-current-lib32-550.90.07-1.mga9.nonfree
nvidia-current-utils-550.90.07-1.mga9.nonfree
x11-driver-video-nvidia-current-550.90.07-1.mga9.nonfree

9/x86_64/core-updates/testing

ldetect-lst-0.6.58-1.mga9
ldetect-lst-devel-0.6.58-1.mga9

9/i586/core-updates/testing

ldetect-lst-0.6.58-1.mga9
ldetect-lst-devel-0.6.58-1.mga9
Comment 5 Morgan Leijström 2024-06-23 11:47:56 CEST
Continuing testing from comment 1 after having been running nvidia470 a couple days.

Swithing nvidia driver by using drakx11 on running 6.6.28 server kernel: OK

Rebooted, used the system for some hours, suspend over night.

After having resumed from suspend, desktop was very sluggish to respond to me switching between open applications.

Also the Plasma panel was unresponsive and not updating.

Swithced to vt4 (Ctrl-alt-F4) and back and got full hang, black screen with frozen mouse pointer. Did not even react to REISUB.

I have seen (and reported) this before, but not with our previous version.

In journal i see nothing suspicious except
jun 23 09:56:00 svarten.tribun DiscoverNotifier[28204]: KCrash: Attempting to start /usr/libexec/DiscoverNotifier
jun 23 09:56:00 svarten.tribun DiscoverNotifier[28204]: KCrash: Application 'DiscoverNotifier' crashing...
- which of course is a result of desktop crashed but not a cause.
Comment 6 Giuseppe Ghibò 2024-06-23 12:11:21 CEST
(In reply to Morgan Leijström from comment #5)

> Continuing testing from comment 1 after having been running nvidia470 a
> couple days.
> 
> Swithing nvidia driver by using drakx11 on running 6.6.28 server kernel: OK
> 
> Rebooted, used the system for some hours, suspend over night.
> 
> After having resumed from suspend, desktop was very sluggish to respond to
> me switching between open applications.
> 
> Also the Plasma panel was unresponsive and not updating.
> 
> Swithced to vt4 (Ctrl-alt-F4) and back and got full hang, black screen with
> frozen mouse pointer. Did not even react to REISUB.
> 
> I have seen (and reported) this before, but not with our previous version.
> 
> In journal i see nothing suspicious except
> jun 23 09:56:00 svarten.tribun DiscoverNotifier[28204]: KCrash: Attempting
> to start /usr/libexec/DiscoverNotifier
> jun 23 09:56:00 svarten.tribun DiscoverNotifier[28204]: KCrash: Application
> 'DiscoverNotifier' crashing...
> - which of course is a result of desktop crashed but not a cause.

So in your tests while 550.90.07 doesn't show any problem and works as good as 550.78, but 470.256.02 seems worst than 470.239.06?
Comment 7 Morgan Leijström 2024-06-23 13:34:19 CEST
With 470 now in testing, I had a problem once but only with kernel 6.6.34 which we will not release.

550.90.07, this bug, is the one I so far experienced problem with once, after resuming from suspend, and that with our released kernel server 6.6.28.

I am now switching to kernel desktop 6.6.28, still nvidia 550.90.07.
Comment 8 Giuseppe Ghibò 2024-06-23 15:56:58 CEST
How did you exactly suspend on ram?

a) from Plasma menu Power Session -> Sleep
b) using command "systemctl suspend" (as root)
c) using "echo mem > /sys/power/state" (as root)

in case you are using a) or b) did you find any difference in using c) instead of a) or b) with respect to your problem?
Comment 9 Morgan Leijström 2024-06-23 19:13:35 CEST
Always used a).

Now tried a few cycles of a) and of c), no problem.
Only difference I see is that using c) there is no login after resume.
Tried with both desktop kernel 6.6.28

Then I switched to server kernel flavour.
Made a few suspend-resume cycles, no problem.

Then tried after uninstalling nvidia-current-cuda-opencl, as it was not installed in both cases, comment 5 and the other transient problem with nvidia470 on non release kernel.

Made a few cycles, no problem.

Then, suddenly, about 20 seconds after resuming OK, the screen transiently went black, then app windows repainted, then Plasma background (was transiently black)  (Probably unrelated but this was after suspend method c). )

In Journal:
jun 23 18:49:55 svarten.tribun kernel: QSGRenderThread[10628]: segfault at fe00000000d ip 00007f86093b7e5c sp 00007f85b7ffea80 error 4 in libQt5Quick.so.5.15.7[7f8609314000+2da000] likely on CPU 2 (core 0, socket 0)
jun 23 18:49:55 svarten.tribun kernel: Code: 89 f1 48 89 d6 66 0f 1f 84 00 00 00 00 00 48 8b 56 08 48 85 d2 74 1c 8b 41 04 83 7e 1c 01 0f 45 01 39 46 18 7f 28 48 8b 76 10 <48> 8b 56 08 48 85 d2 75 e4 0f b6 5e 20 84 db 74 09 c6 46 20 00 e8
 ---< a few normal post resume lines here >---
jun 23 18:49:56 svarten.tribun ksystemstats[66702]: Could not retrieve information for NVidia GPU "0000:07:00.0"

System anyway kept running OK and still work OK while I write this - including Plasma panel, launching apps, switching virtual desktops.

"QSGRenderThread" is not in journal except today for all time since july 4.

So the problem seem to hit rarely and in somewhat random way on my system.

Machine is pretty old, so it *may* be hardware problem, and maybe the problem is more likely to hit when cold (after having slept), but generally hardware faults  are more usual at temperature extremes...

I do not think the problem on my system should hold up this bug.
Comment 10 Giuseppe Ghibò 2024-06-23 20:58:59 CEST
nvidia-current-cuda-opencl was pulled out of deps long ago because it doesn't fit in the live ISOs, so has to be installed manually.

Apart transition from/to 550<->470, what is weird are the hang on VT switching on 550-xx that I thought were left behind with a fix long ago. BTW, during the transition 550<->470 next times try to do a power down cycle, not just reboot.

As for the machine hardware, I've also an old series (in the lowlatency) based on 5.10.x series, here (it's in the oldversionscheme, so it can be installed beside other kernel without having to always push version forward, just *-desktop and *-desktop-devel are required for dkms building):

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/07622029-kernel/

sometimes with 5.10.x you don't get the same probs as the latest series (it's just for splitting). BTW if you get interfering from 6.6.34 during uploading/downloading, try to add it (package by package) /etc/urpmi/skip.list.
Comment 11 Thomas Andrews 2024-06-24 03:29:11 CEST
MGA9-64 Plasma, i5-7500, Quadro K620, server and desktop 6.6.28 kernel.

Tested the 470 driver first, with no issues. Then, while in the desktop kernel, used MCC to switch to this nvidia-current. Rebooted into the server kernel, and the modules were built during the boot. No issues to report after the boot.

The production install on the same hardware, using the desktop kernel, was updated from the previous nvidia-current a couple of days ago. There have been no issues to report so far.

CC: (none) => andrewsfarm

Comment 12 Morgan Leijström 2024-06-24 11:45:40 CEST
(In reply to Giuseppe Ghibò from comment #10)
> Apart transition from/to 550<->470, what is weird are the hang on VT
> switching on 550-xx that I thought were left behind with a fix long ago.

Yes this is definitely a regression.
Tested again and it hung when vt switching back to Plasma desktop also when not showing other problems after a resume.


Others testing nvidia drivers, could you also test vt switching?
i.e
! save work first: close mail program and inter having open user files etc!
ctrl-alt-F4 (gives fullscreen text terminal)
ctrl-alt-F1 (some log)
ctrl-alt-F2 (back to Plasma desktop - here system hangs hard for me)
I believe some other desktop system use login at -F2, desktop at -F3 or similar. 


> BTW, during the transition 550<->470 next times try to do a power down
> cycle, not just reboot.

OK

Will try that too, and vt switching on 470.

Later - Got to work...

And backup files before playing much...
Pity this only system I have capable of nvidia drivers is my work computer...


> I've also an old series (in the lowlatency) based on 5.10.x series

I may try your low latency kernels like I have earlier, if I experience more problems. But generally I like to keep to Mageia standard for QA purposes.
Comment 13 Thomas Andrews 2024-06-24 15:53:02 CEST
Quadro K620 here, and the cuda package is NOT installed. If the default is to not have it, then we must test without it as well as with it. I have never really had any reason to miss it, that I know of. 

Asus Prime Q270M-C motherboard, latest UEFI firmware installed, with an i5-7500 processor, 48GB of RAM, two M.2 SSDs. Logitech K330 keyboard and M325 mouse, using the Unifying receiver.

I don't remember ever using these commands before, so I have no idea what is *supposed* to happen with this hardware.

ctrl-alt-F4 gives me the terminal.
ctrl-alt-F1 takes me back to the desktop, with a notification from kwin that desktop effects were restarted due to a graphics reset.
ctrl-alt-F2 from the desktop gives me the terminal again.
ctrl-alt-F1 takes me back to the desktop.
ctrl-alt-F3 from the desktop also gives me the terminal.

Nothing seems to lock up.
Comment 14 Thomas Andrews 2024-06-24 16:19:27 CEST
Same hardware, another install, this time I installed the cuda package.

No difference in the response to the commands. (I didn't expect one, but checked, anyway) I even logged in as "tom" and tried a command before issuing the ctrl-alt-F1 command, and it dropped me back to the desktop as if I had never left.
Comment 15 Morgan Leijström 2024-06-24 16:30:07 CEST
Thank you Thomas.

Good to see it is only my system that makes this problem show so far.
And also for earlier version which had this problem on my hardware, no one else reported that problem.

Switching vt is nothing ordinary users do.

If we do not see more testers soon I think we can send this out as well as the 470.
Comment 16 Giuseppe Ghibò 2024-06-24 16:36:01 CEST
(In reply to Morgan Leijström from comment #12)

> I may try your low latency kernels like I have earlier, if I experience more
> problems. But generally I like to keep to Mageia standard for QA purposes.

The idea of the 5.10.x suggesting is just a further alternative to filter out possiblyl hardware degrading problems (e.g overheating, bad mems, etc.), beyond buggy firmware, buggy bios, etc.;

In your case, did you changed the hardware topology recently? E.g. adding a new device, or moved a card from slots, e.g. a USB xhci_hcd device, that might cause lockup on suspend?
Comment 17 Morgan Leijström 2024-06-24 16:46:18 CEST
(In reply to Giuseppe Ghibò from comment #16)
> The idea of the 5.10.x

OK, will try later

>  did you changed the hardware

No hardware at all changed for over a year.
Comment 18 Morgan Leijström 2024-06-25 19:19:17 CEST
No change in HW or SW since last comment.
Today I went to customer, and a three hours later came back and resumed the computer, only to see black screen.  
Also this is an old issue i have not seen last couple nvidia releases.
Sometimes earlier when it happened it was only screen that did not wake up but now like sometimes before i had to REISUB to make it reboot back to life.
It was not a clean shutdown though; no logging  between suspend and shutdown, and it performed file systems checks.

--

Now I have changed to nvidia470 using drakx11, and have not installed -cuda-opencl.  Shut down, And stated again.

Lets see how 470 works for a few days.

Later I will try the lowlatency kernel.
Comment 19 Giuseppe Ghibò 2024-06-25 20:43:55 CEST
Remember to add module_blacklist=nouveau to the /etc/default/grub, and update-grub.

What cat /proc/acpi/wakeup returns?
Comment 20 Morgan Leijström 2024-06-25 22:00:42 CEST
Why should we need to manually add module_blacklist=nouveau to the /etc/default/grub, and update-grub?

(I have not added it)

$ cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="noiswmd nokmsboot resume=/dev/vg-mga/lv_swap audit=0 vga=794"
GRUB_DEFAULT=saved
GRUB_DISABLE_OS_PROBER=false
GRUB_DISABLE_RECOVERY=false
GRUB_DISABLE_SUBMENU=n
GRUB_DISTRIBUTOR=Mageia
GRUB_ENABLE_CRYPTODISK=y
GRUB_GFXMODE=1024x768x32
GRUB_GFXPAYLOAD_LINUX=text
GRUB_SAVEDEFAULT=true
GRUB_TERMINAL_OUTPUT=gfxterm
GRUB_THEME=/boot/grub2/themes/maggy/theme.txt
GRUB_TIMEOUT=5

$ cat /proc/acpi/wakeup
Device	S-state	  Status   Sysfs node
P0P1	  S4	*disabled
P0P3	  S4	*disabled  pci:0000:00:03.0
P0P4	  S4	*disabled
P0P5	  S4	*disabled
P0P6	  S4	*disabled
BR1E	  S4	*disabled  pci:0000:00:1e.0
PS2K	  S4	*enabled   pnp:00:05
		*disabled  serio:serio0
PS2M	  S4	*disabled
UAR1	  S4	*disabled  pnp:00:06
EUSB	  S4	*disabled
USB0	  S4	*enabled   pci:0000:00:1d.0
USB1	  S4	*disabled
USB2	  S4	*disabled
USB3	  S4	*disabled
USBE	  S4	*disabled
USB4	  S4	*enabled   pci:0000:00:1a.0
USB5	  S4	*disabled
USB6	  S4	*disabled
BR20	  S4	*disabled  pci:0000:00:1c.0
BR21	  S4	*disabled  pci:0000:00:1c.1
BR22	  S4	*disabled  pci:0000:00:1c.2
BR23	  S4	*disabled  pci:0000:00:1c.3
BR24	  S4	*disabled  pci:0000:00:1c.4
BR25	  S4	*disabled
BR26	  S4	*disabled
BR27	  S4	*disabled
SLPB	  S4	*disabled
Comment 21 Thomas Andrews 2024-06-26 00:54:54 CEST
It's been a while since I dealt with this, but if I recall correctly that's what the nokmsboot is for.
Comment 22 Morgan Leijström 2024-06-26 09:37:25 CEST
This is the boot command line as taken from journal:

jun 25 18:50:05 svarten.tribun kernel: Command line: BOOT_IMAGE=/vmlinuz-6.6.28-desktop-1.mga9 root=/dev/mapper/vg--mga-lv_root ro noiswmd nokmsboot resume=/dev/vg-mga/lv_swap audit=0 vga=794
Comment 23 Morgan Leijström 2024-06-26 19:24:21 CEST
(In reply to Giuseppe Ghibò from comment #10)
> https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/
> mageia-9-x86_64/07622029-kernel/

Which version of cpupower and lib64bpf1 to use for that kernel?
- Should I just keep the 6.6.28 versions? 

So I only need three packages of that 5.10.219-2 ?
kernel-desktop, kernel-desktop-devel, kernel-userspace-headers
or not even -headers ?
Comment 24 Giuseppe Ghibò 2024-06-26 21:49:19 CEST
(In reply to Morgan Leijström from comment #23)
> (In reply to Giuseppe Ghibò from comment #10)
> > https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/
> > mageia-9-x86_64/07622029-kernel/
> 
> Which version of cpupower and lib64bpf1 to use for that kernel?
> - Should I just keep the 6.6.28 versions? 

yes, keep those 6.6.28 version.

> 
> So I only need three packages of that 5.10.219-2 ?

only 2.

> kernel-desktop, kernel-desktop-devel, kernel-userspace-headers
> or not even -headers ?

not even headers, only kernel-desktop and kernel-desktop-devel (-devel is for dkms building).
Comment 25 Brian Rockwell 2024-06-27 04:20:24 CEST
MGA9-64, Plasma, Ryzen 5600, Nvidia 1050

The following 4 packages are going to be installed:

- dkms-nvidia-current-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-cuda-opencl-550.90.07-1.mga9.nonfree.x86_64
- nvidia-current-utils-550.90.07-1.mga9.nonfree.x86_64
- x11-driver-video-nvidia-current-550.90.07-1.mga9.nonfree.x86_64

1.1MB of additional disk space will be used.



--rebooted
$ nvidia-smi
Wed Jun 26 20:06:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050        Off |   00000000:05:00.0  On |                  N/A |
| 45%   32C    P0             N/A /   75W |     395MiB /   2048MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1941      G   /usr/libexec/Xorg                             122MiB |
|    0   N/A  N/A      2234      G   /usr/bin/kwalletd5                              1MiB |
|    0   N/A  N/A      2390      G   /usr/bin/ksmserver                              1MiB |
|    0   N/A  N/A      2392      G   /usr/bin/kded5                                  1MiB |
|    0   N/A  N/A      2393      G   /usr/bin/kwin_x11                              39MiB |
|    0   N/A  N/A      2472      G   /usr/bin/plasmashell                           21MiB |
|    0   N/A  N/A      2500      G   ...c/polkit-kde-authentication-agent-1          1MiB |
|    0   N/A  N/A      2503      G   /usr/libexec/xdg-desktop-portal-kde             1MiB |
|    0   N/A  N/A      2568      G   /usr/bin/nextcloud                              7MiB |
|    0   N/A  N/A      2570      G   /usr/bin/python3                                1MiB |
|    0   N/A  N/A      2579      G   /usr/libexec/kdeconnectd                        1MiB |
|    0   N/A  N/A      2585      G   /usr/bin/kaccess                                1MiB |
|    0   N/A  N/A      2591      G   /usr/bin/kalendarac                             1MiB |
|    0   N/A  N/A      2653      G   /usr/bin/akonadi_control                        1MiB |
|    0   N/A  N/A      2719      G   /usr/bin/akonadi_akonotes_resource              1MiB |
|    0   N/A  N/A      2720      G   /usr/bin/akonadi_archivemail_agent              1MiB |
|    0   N/A  N/A      2721      G   /usr/bin/akonadi_birthdays_resource             1MiB |
|    0   N/A  N/A      2722      G   /usr/bin/akonadi_contacts_resource              1MiB |
|    0   N/A  N/A      2723      G   .../bin/akonadi_followupreminder_agent          1MiB |
|    0   N/A  N/A      2724      G   /usr/bin/akonadi_ical_resource                  1MiB |
|    0   N/A  N/A      2733      G   /usr/bin/akonadi_indexing_agent                 1MiB |
|    0   N/A  N/A      2734      G   /usr/bin/akonadi_maildir_resource               1MiB |
|    0   N/A  N/A      2737      G   /usr/bin/akonadi_maildispatcher_agent           1MiB |
|    0   N/A  N/A      2738      G   /usr/bin/akonadi_mailfilter_agent               1MiB |
|    0   N/A  N/A      2740      G   /usr/bin/akonadi_mailmerge_agent                1MiB |
|    0   N/A  N/A      2743      G   /usr/bin/akonadi_migration_agent                1MiB |
|    0   N/A  N/A      2748      G   /usr/bin/akonadi_newmailnotifier_agent          1MiB |
|    0   N/A  N/A      2749      G   /usr/bin/akonadi_notes_agent                    1MiB |
|    0   N/A  N/A      2751      G   /usr/bin/akonadi_sendlater_agent                1MiB |
|    0   N/A  N/A      2753      G   /usr/bin/akonadi_unifiedmailbox_agent           1MiB |
|    0   N/A  N/A     13823      G   /usr/bin/firefox                              135MiB |
|    0   N/A  N/A     13949      G   /usr/lib/mozilla/kmozillahelper                 1MiB |
|    0   N/A  N/A     15764      G   /usr/libexec/baloorunner                        1MiB |
|    0   N/A  N/A     15770      G   /usr/bin/konsole                                1MiB |
+-----------------------------------------------------------------------------------------+

firefox working as expected including videos
libreoffice painting properly

working as expected.
Comment 26 Morgan Leijström 2024-06-29 17:44:26 CEST
1)

I have been using nvidia470-470.256.02-1 Bug 33317 again for a while with the 6.6.28 desktop kernel and can say it does not show the problem that I see for nvidia-current in this bug.  470 never showed any problem on this system.  Tested vt switching, several short suspend-resume cycles, and also suspend overnight OK.

I also tested it OK shortly on kernel 5.10.219-desktop-2.lowlatency.500hz.mga9, comment 24.


2) 

With the lowlatency kernel i have not had a hard hang, and less problem overall.
But once after vt switching and then suspend-resume, after logging in desktop did not come up and after a while it shut down... ??

---

I see kernel 6.6.36 building. Ready to test?

---

Weird glitch

One weird thing which is not to be taken further in this bug report (especially for running non official kernel) was that I was running that lowlatency kernel and nvidia470, used drax11 to switch to nvidia-current-550.90.07-1. Then when shitting down the system it dropped to debug shell. Some interesting lines:
 dm_log dm_mod [last unloaded: vboxdrv]
...
 /shutdown: line 154: 312843 Killed        $ACTION -f -n
...
 dracut warning: poweroff failed!

I have since rebooted twice, no problem.
Hmmm
Comment 27 Thomas Andrews 2024-06-29 23:08:57 CEST
I'm a bit lost. I've been using this for several days with kernel 6.6.28 with zero problems. Brian tested with nothing out of the way to report.

Morgan, it is your issues that confuse me. If I'm reading correctly, your system works OK with the 470 driver, but has an intermittent problem with 550.90. Have you been able to determine if the problem is being caused by a glitch in your particular piece of aging hardware, or by the driver?

Meaning, should we continue to hold this back, or let it go?
Comment 28 Morgan Leijström 2024-06-30 10:18:14 CEST
IMO let it go.

The problem have shown before, then was gone for a couple versions, and now it is back. 470 had the problem too some versions ago.
So I do not think it is *aging* hardware, but a rarely showing bug in nividia driver/chip *combination*, maybe in conjunction with kernel, cpu, other hardware...

Pity we are so few testing compared to the myriad of chip / other hardware / kernel combinations possible.
Morgan Leijström 2024-06-30 10:37:27 CEST

Whiteboard: (none) => MGA9-64-OK
Keywords: (none) => validated_update
CC: (none) => sysadmin-bugs

Comment 29 Giuseppe Ghibò 2024-06-30 11:36:38 CEST
(In reply to Morgan Leijström from comment #26)

> ---
> 
> I see kernel 6.6.36 building. Ready to test?
> 
> ---

6.6.36-2 should be the good one. It still have to rebuild the kmod-virtualbox and xtables-addons when build finished.
Comment 30 Morgan Leijström 2024-06-30 13:56:40 CEST
(In reply to Thomas Andrews from comment #27)
> I've been using this for several days with kernel 6.6.28
> with zero problems. Brian tested with nothing out of the way to report.

I do not see any other reports than mine on vt switching nor suspend-resume.
And no report (including myself) on hibernation.

These are serious problems when they occur as work may not have been saved.

Other than mentioned tests this nvidia-current perform well on my system "svarten" as well.

I have never seen any report on forum, nor other testers, so my system may have some kind of unusually bad luck of design in this respect.

---

Also nvidia-newfeature 555.52.04-1.mga9 (fresh in testing repo today) have problems with kernel 6.6.28, both desktop and linus flavours tested.

---

No hard hang with either of the three nvidia drivers with 5.10.219-desktop-2.lowlatency.500hz.mga9. Got repaint problem once after resume but fixed itself after window focus change.

---

kernel 5.10.219-desktop-2.lowlatency.500hz.mga9 last tests *always* drop to debug shell (Comment 26 last part) when trying reboot or shut off.
Easy to REISUB from there.
Do it need correct version of cpupower or other package?

---

(In reply to Giuseppe Ghibò from comment #29)
> (In reply to Morgan Leijström from comment #26)
> > I see kernel 6.6.36 building. Ready to test?
> 
> 6.6.36-2 should be the good one. It still have to rebuild the
> kmod-virtualbox and xtables-addons when build finished.

Anwyay it is good to test local vbox kmod build is working.

Next, I will try desktop kernel 6.6.36-2 next, with nvidia-current-550.90.07-1
Comment 31 Giuseppe Ghibò 2024-06-30 14:28:29 CEST
(In reply to Morgan Leijström from comment #30)

> (In reply to Thomas Andrews from comment #27)
> > I've been using this for several days with kernel 6.6.28
> > with zero problems. Brian tested with nothing out of the way to report.
> 
> I do not see any other reports than mine on vt switching nor suspend-resume.
> And no report (including myself) on hibernation.
> 
> These are serious problems when they occur as work may not have been saved.
> 
> Other than mentioned tests this nvidia-current perform well on my system
> "svarten" as well.
> 
> I have never seen any report on forum, nor other testers, so my system may
> have some kind of unusually bad luck of design in this respect.
> 
> ---
> 
> Also nvidia-newfeature 555.52.04-1.mga9 (fresh in testing repo today) have
> problems with kernel 6.6.28, both desktop and linus flavours tested.
> 
> ---
> 
> No hard hang with either of the three nvidia drivers with
> 5.10.219-desktop-2.lowlatency.500hz.mga9. Got repaint problem once after
> resume but fixed itself after window focus change.
> 

there are also a 5.10.220-2.ll, and 6.1.95-2.ll from the same source.

> ---
> 
> kernel 5.10.219-desktop-2.lowlatency.500hz.mga9 last tests *always* drop to
> debug shell (Comment 26 last part) when trying reboot or shut off.
> Easy to REISUB from there.
> Do it need correct version of cpupower or other package?
> 

no, keep the one bundled with latest 6.6.36-2.mga, if installed that one.

IMO is not the aging of the hardware, but probably it could be more related towards the motherboard (maybe by hw design or buggy firmware) rather than the gfx card. Could be also that some particular BIOS setting get lost during reset/power outage, etc.?

In the past we had bisected the combination as being the floppy controller responsible (or co-responsible) for problems with resume from suspend, but now probably there were others.

Maybe finding the device or the bus that was affecting it could be tried to unbind that PCI device from the drivers list with a simple command like: echo "<pci-id>" > /sys/block/bus/pci/drivers/<driver>/unbind

Other attempts could be to see whether with a RS232 or a crossed USB active cable (which is not easy to find, or self-building) one could debug more, or alternatively get another used motherboard socket 1156 in the 20-30E price tag.
Comment 32 Thomas Andrews 2024-06-30 17:58:54 CEST
(In reply to Morgan Leijström from comment #30)
> (In reply to Thomas Andrews from comment #27)
> > I've been using this for several days with kernel 6.6.28
> > with zero problems. Brian tested with nothing out of the way to report.
> 
> I do not see any other reports than mine on vt switching nor suspend-resume.
> And no report (including myself) on hibernation.
> 
> These are serious problems when they occur as work may not have been saved.
> 
> Other than mentioned tests this nvidia-current perform well on my system
> "svarten" as well.
> 
> I have never seen any report on forum, nor other testers, so my system may
> have some kind of unusually bad luck of design in this respect.
> 
Suspend/hibernate/resume in Mageia has never been consistent for me. I never use them with my desktops, so don't think of it, but I have used them, or tried to, with my laptops. With varying results - it has worked in the past but not for a while - and none of them have nvidia gpus. 

My belief is that each motherboard/firmware is enough different that one-size-fits-all solutions just don't work.
Comment 33 Giuseppe Ghibò 2024-07-01 00:16:58 CEST
We have to consider that there were different nvidia series that passed under your motherboard since mga9:

- 535.154.05
- 550.54.14
- 550.67
- 550.76
- 550.78
- 550.90.07

- 470.199.02
- 470.239.06
- 470.256.02

So for each version, beyond all the fixes we could have added on our side, there was somewhat of "compatibility" matrix (and for those upstream too). But problems of this kinds affects also other distro.

So, to summarize what thomas said, with 470.256.02 it seems the most stable series on your motherboard with respect to suspend/resume with any kernel. 

555.54.02 would probably shows up the same problems as 550.90.xx.

With hibernation instead of suspend you get the same problems? When the system won't resume the video correctly the host is still accessible via ethernet/ssh? If not, at least it answers to ping(s)?
Comment 34 Thomas Andrews 2024-07-01 02:27:18 CEST
(In reply to Thomas Andrews from comment #27)
> I've been using this for several days with kernel 6.6.28
> with zero problems. Brian tested with nothing out of the way to report.
>
>I do not see any other reports than mine on vt switching nor suspend-resume.
>And no report (including myself) on hibernation.

Against my better judgement, I booted into my test install with the server kernel, Plasma, Asus Prime Q270M-C motherboard, and nvidia Quadro K620, opened a few apps, and put it into hibernation using the "hibernate" selection of the logout menu. The LED showed hard drive access, then it shut down.

I waited a few minutes, then hit the desktop's power button. The power LED lit, and that's it. No signal to the monitor, no hard drive activity, no POST. Nothing. I gave it a couple of minutes, no difference. I shut down by holding the power button, waited again, tried again, same result.

Panic tried to ensue. Was my best hardware now a doorstop? I fought it back. Not knowing what else to try, I removed power from all the hardware by using the switch on the power strip. (When all else fails...) I waited until all LEDs went out, card reader, monitor, etc. Then waited another 30 seconds, restored power, and hit the desktop's power switch again. The POST appeared, then rEFInd, with the test install selected. Enter brought up the desktop, with all applications restored to their former states. WHEW!

Reaching deep within for courage, I repeated the test, with different apps open. It acted exactly the same. If I used hibernate from the Plasma logout, I had to remove all power from the system to get it back.

BUT IT DOES COME BACK. Morgan, hibernation works with the 550.90 driver and the server kernel on my hardware, if the user can avoid panicking.
Comment 35 Morgan Leijström 2024-07-01 09:20:38 CEST
@Thomas, good you test. I know the angst.
That you had to remove power mechanically is a sign that it did not power off completely when entering hibernation. I have a similar problem with my Thinkpad T510: have to hold down power button after disk lamp have stopped and power lamp start flashing indicating kernel panic.  At power on it resumes correctly so it is just the power off that fail. Why, I do not know how to investigate as it did shut off display and logging... That machine do not use nvidia drivers - too old in nvidia world.

I think we should test if hibernation with free drivers works before trying nvidia as added complication.
- But I am inclined to test that dare game too now.

I am thinking it is my main board that have some nonstandard quirk and should not be used for thus testing, but as we are too few testing, i go on... I do not have much more time and motivation for this though.

nvidia 555.54.01 is worst so far regarding suspend-resume - it always fail with kernels 6.6.28 and .34 (not with 5.10.219-desktop-2.lowlatency).  It is also the only fail that returns with a text screen - looking like system journal but it is not saved as such. odd.

Hm, now I see new minor version nvidia 555.54.*02* got built.

Yes 470 have always seemed more reliable than 5xx on my "svarten".

Last testing: kernel-desktop-6.6.36-2.mga9 + nvidia-current 550.90.07-1.mga9
I had altered two settings in BIOS related to suspend (forgot names...) bit I do not see a difference, I think.  With this kernel and driver, it is back to the situation Ii experienced months ago: It succeed to resume after a short suspend sleep, but if I wait hours it resume to black screen.  I issued the "REI" part of REISUB, and sddm login appeared and could log in but had no network. Hm, maybe this is the strange monitor state where I also historically could have power cycled the monitor.

It is a mystery that this combination fail resuming after hour(s?) sleep and not after a minutes sleep.  Due to timeout to undocumented deep sleep in monitor or graphics card?  Time/date jump in the system?  Chip temperature? - no it works cold start...

Too many parameters for effective testing.

Anyway, just now I went into BIOS and disabled the floppy driver, see if that goes better. Will see next morning. 6.6.36-2 + 550.90.07-1
Comment 36 Morgan Leijström 2024-07-01 09:46:46 CEST
Follow-up to the paragraph "Last testing" in Comment 35:
I let the system sleep during coffe break...
 (actually I went eating red currant directly from my plant :) )
...and now when I got back, the computer woke up to black screen.
I powered cycled the monitor and could log in to restored desktop OK.
So no hang, and I recognise the situation from half a year or so.
Now with desktop kernel 6.6.36-2 + nvidia 550.90.07-1.
Comment 37 Morgan Leijström 2024-07-01 09:51:38 CEST
Hibernation test full OK on my "svarten".
desktop kernel 6.6.36-2 + nvidia 550.90.07-1.
Comment 38 Morgan Leijström 2024-07-01 12:57:58 CEST
* Versions not in this bug but tests relevant for comparison. *


6.6.36-desktop-2.mga9 + nvidia 555.54.02
suspend-resume to text screen which do not change whatever I press.
Notable successive lines in that text:
 note: irq/33-nvidia[6910] exited with irqs disabled
 Fixing recursive fault but reboot is needed!
 BUG: scheduling while atomic: irq/33-nvidia/6910/0x00000000
Issued REISUB, not until the "B" anything visibly happened: black, reboot.

---

(In reply to Morgan Leijström from comment #35)
> nvidia 555.54.01 is worst so far regarding suspend-resume - it always fail
> with kernels 6.6.28 and .34 (not with 5.10.219-desktop-2.lowlatency).  It is
> also the only fail that returns with a text screen - looking like system
> journal but it is not saved as such. odd.

Now was running nvidia 555.54.02 + 5.10.219-desktop-2.lowlatency and resuming after lunch went to black screen, power cycling monitor do dot help, had to do full REISUB.
Comment 39 Thomas Andrews 2024-07-01 14:07:47 CEST
I think the hibernate/suspend discussion deserves a bug of its own, rather than here. This update has already been validated.

This ongoing problem has possible sources other than nvidia video drivers. My AMD-based Pavilion, for example, will suspend by closing the lid and resume like a champ, but try to hibernate and it reboots to a new session every time. With no nvidia hardware, nvidia drivers can't be the issue. 

My Intel-based Probook 6550b has also had problems in this area off and on over the years, some more serious than others. The latest one generated intermittent boot failures that didn't go away until I re-installed Mageia. No nvidia there, either.
Comment 40 Mageia Robot 2024-07-01 19:55:03 CEST
An update for this issue has been pushed to the Mageia Updates repository.

https://advisories.mageia.org/MGAA-2024-0154.html

Resolution: (none) => FIXED
Status: NEW => RESOLVED

Comment 41 Morgan Leijström 2024-07-02 10:34:53 CEST
The suspend-resume issue I see on my system "svarten" seem to be bound to nvidia driver.  For some versions it is more or less mitigated by using a non-official kernel.

But using free driver nouveau or modesetting, there is no problem when using any kernel.
This I have now verified also with kernel-desktop-6.6.36-2.mga9.

Currently also no problem when using latest nvidia470, or previous 470 and -current.

So the problem that express on my system is reintroduced by the version released in this bug. Now also tested it fail with kernel-desktop-6.1.95-2.lowlatency.mga9-1-1.  Also the nvidia newfeature 555.54.02 fail as reported, also tested now with linus 6.6.281, desktop 6.6.36-2, desktop 6.1.95-2.lowlatency.

The reason I do not oppose 550.90.07-1 being released is that we have released several versions before incl mga9 release version with this problem, and I have not seen any other user complaining.  It seem so rare I do not even think it should be in Errata, until we see at least one more user seeing this.

Still, we should fight it as we know very few of users who experience problems report it so we do not know how many are affected.

The problem I see is regarding Mageia most closely tied to the nvidia driver. Possibly a combination with nvidia chip, implementation, main board, and kernel.
But I see no better place to handle this in our bugzilla, than for the nvidia drivers.  So next up is probably reporting in a coming a nvidia-newfeature update bug.

For other suspend/hibernation problems yes separate bugs, probably to set to kernel/driver maintainers. Like I had for a specific laptop in Bug 22804. That did not itself help, but some years later problem is gone.  Similarly for another laptop Bug 32122.  I know I have also commented on other laptops and maybe some stationary.  The general feeling is that support for suspend and hibernation have substantially improved last year, after previously having regressed.
Comment 42 Morgan Leijström 2024-07-02 12:12:13 CEST
Updated
https://wiki.mageia.org/en/Setup_the_graphical_server#Known_Nvidia_issues
which is linked from Errata.
Comment 43 Morgan Leijström 2024-07-08 23:14:43 CEST
The problem of my system svarten hanging on resume from suspend with this nvidia-current-550.90.07-1 seem to be resolved by kernels 6.6.37-1 :)

Note You need to log in before you can comment on or make changes to this bug.