31695 – Kernel 6.2+ regression resuming from suspend & vt switching problems on some nvidia systems - works with free drivers or kernel linus

Bug 31695 - Kernel 6.2+ regression resuming from suspend & vt switching problems on some nvidia systems - works with free drivers or kernel linus

Summary: Kernel 6.2+ regression resuming from suspend & vt switching problems on some ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	RPM Packages (show other bugs)
Version:	Cauldron
Hardware:	All Linux

Priority:	Normal Severity: major
Target Milestone:	---
Assignee:	Kernel and Drivers maintainers
QA Contact:

URL:
Whiteboard:
Keywords:	IN_ERRATA9

Duplicates (1):	32541 (view as bug list)
Depends on:
Blocks:

Reported:	2023-03-18 00:28 CET by Morgan Leijström
Modified:	2024-04-21 23:37 CEST (History)
CC List:	4 users (show)

See Also:
Source RPM:
CVE:
Status comment:	Depends on patch see c42, and absent diskette drive c58

Attachments
Journal from booting same kernel after switcing nvidia 535 to 545 (25.38 KB, application/x-xz) 2024-02-10 14:15 CET, Morgan Leijström	Details
View All Add an attachment (proposed patch, testcase, etc.)

Description Morgan Leijström 2023-03-18 00:28:04 CET

Description of problem:
Resuming from suspend, I only see a mouse pointer on black screen.
I can Ctrl-Alt-F2 and log in as root and reboot

Plasma, nouveau

Kernel versions:
  Fail: 6.2.2-2, 6.2.6-1
  OK: 6.1.12-2, 6.1.14-1

How reproducible: Every try

Excerpt from journal:
mar 17 21:13:51 localhost plasmashell[4088]: Aborting shell load: The activity manager daemon (kactivitymanagerd) is not running.
mar 17 21:13:51 localhost plasmashell[4088]: If this Plasma has been installed into a custom prefix, verify that its D-Bus services dir is known to the system for>
mar 17 21:13:51 localhost kernel: ACPI: \_SB_.PCI0.LPC_.EC__.HKEY: BCSG: evaluate failed
mar 17 21:13:51 localhost kernel: ------------[ cut here ]------------
mar 17 21:13:51 localhost kernel: irq 26 handler nvkm_intr+0x0/0x240 [nouveau] enabled interrupts
mar 17 21:13:51 localhost kernel: WARNING: CPU: 2 PID: 3547 at kernel/irq/handle.c:161 __handle_irq_event_percpu+0x153/0x1a0
mar 17 21:13:51 localhost kernel: Modules linked in: rfcomm ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat ip6table_raw ip6table_filter ip6_ta>
mar 17 21:13:51 localhost kernel:  btbcm btmtk videobuf2_common btrtl mei_wdt btintel mc kvm bluetooth snd_ctl_led irqbypass snd_hda_codec_conexant ecdh_generic e>
mar 17 21:13:51 localhost kernel:  drm_kms_helper mxm_wmi video wmi drm cec dm_mirror dm_region_hash dm_log dm_mod
mar 17 21:13:51 localhost kernel: CPU: 2 PID: 3547 Comm: Xorg Not tainted 6.2.6-desktop-1.mga9 #1
mar 17 21:13:51 localhost kernel: Hardware name: LENOVO 4349A13/4349A13, BIOS 6MET92WW (1.52 ) 09/26/2012
mar 17 21:13:51 localhost kernel: RIP: 0010:__handle_irq_event_percpu+0x153/0x1a0

Tell what info I shall look for.

Comment 1 Morgan Leijström 2023-03-20 16:43:48 CET

Still valid kernel 6.2.7-desktop-1

With all other updates per now.

CC: (none) => tmb
Assignee: tmb => kernel

Comment 2 Morgan Leijström 2023-03-24 11:52:14 CET

Same with 6.2.8-desktop-1.mga9

Comment 3 Morgan Leijström 2023-04-10 00:07:52 CEST

Full update, not better.

Observation

When at black screen with cursor, i can ctrl-alt-backspace,backspace and the login dialogue appear again.  I enter password and go, but computer hangs, cant Ctrl-Alt-F2  to another terminal, and neither ctrl-alt-backspace,backspace nor ctrl-alt-del,del works.

When i use xfce instead of Plasma, desktop appear, but Firefox is unresponsive.  I can launch the notepad and enter some text. But soon desktop is frozen.

Tried switching from SDDM to LightDM, minor better.

Nothing obvious in journal.
Next, i should try another graphics driver.

Comment 4 Morgan Leijström 2023-04-10 23:05:15 CEST

Drives used is nouveau - I can not find another working at all.

GPU: GT218M [NVS 3100M]
Proprietary nvidia driver is supposed to be the 340, which we do not have.
I tried a couple other but it failed.
Also tried and failed Xorg vesa.

So this bug I think is about regression in kernel 6.2 compatibility with nouveau driver for this GPU.

Comment 5 Giuseppe Ghibò 2023-04-10 23:20:45 CEST

I got the same under nouveau too, however I get exactly the same problem using modesetting under qemu-5.2.0 (and with native to intel driver), so probably it's not just limited to nouveau.

As for nvidia340 it's in obsolete now, because EOL, however with a bit of patience and if you do not care about potential security problems and for testing, it could be possible to rebuild locally the driver 340.xx (admitting it works with current xorg due to API increased version) from here:

 https://svnweb.mageia.org/packages/obsolete/nvidia340/current/SPECS/

and then applying on sequence the following patchset from patch for kernel 5.11 up to kernel 6.2:

https://bugs.mageia.org/show_bug.cgi
https://svnweb.mageia.org/packages/obsolete/nvidia340/current/SPECS/nvidia340.spec?view=log

https://aur.archlinux.org/cgit/aur.git/tree/0005-kernel-5.11.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0006-kernel-5.14.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0007-kernel-5.15.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0008-kernel-
5.16.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0009-kernel-5.17.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0010-kernel-5.18.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0011-kernel-6.0.patch?h=nvidia-340xx

https://aur.archlinux.org/cgit/aur.git/tree/0012-kernel-6.2.patch?h=nvidia-340xx

CC: (none) => ghibomgx

Comment 6 Morgan Leijström 2023-04-10 23:45:10 CEST

Some more testing:

§ Testing today above and below is with kernel 6.2.10-desktop-2.mga9 

§ same level of problem in iceVM as in xcfe: after resume it kind of works for a while but soon hang.

§ I see artefacts in xcfe window frames (but that is an old problem), but not in Plasma nor iceVM

§ SDDM ficker when i move mouse. I think i have seen it before. LightDM do not show this problem.

§ After disabling hardware acceleration and hardware mouse pointer, i could use iceVM after resume wihtout problem (did not test very long, but definitely better)  Resuming into xfce it hung pretty quick.

Morgan Leijström 2023-04-10 23:49:26 CEST

Keywords: (none) => FOR_ERRATA9
Hardware: x86_64 => All
Summary: Kernel 6.2 breaks resuming from suspend on Thinkpad T510 (OK with 6.1 series) => Kernel 6.2 regression resuming from suspend on some systems

Comment 7 Morgan Leijström 2023-04-12 13:48:06 CEST

For simplest workaround in errata, is it a bad idea to use kernel 6.1 from mga8 backport?  (I have not tried)

Or could we get a 6.1 kernel as alternative in mga9?

Or will we be able to fix whatever problem there is with kernel6.2/nouveau/x11/..

I have not tried switching to Wayland, could that be an idea?

Comment 8 Giuseppe Ghibò 2023-04-12 20:03:07 CEST

We could try kernel-linus and if it shows the same problems we can try to do a report on upstream bugzilla.kernel.org.

Comment 9 Morgan Leijström 2023-04-13 01:23:11 CEST

Same problem using kernel-linus-6.2.10-1.mga9

Sidenote:
Another problem have showed up with Plasma: after last days updates it do not get to desktop even with kernel 6.1.14 and a fresh user.  Possibly worsened incompatibility with nouveau on this GPU? Or something broke due to the hard reboots from hang...  Will do fresh install with beta 2 ISOs when available.

Comment 10 Morgan Leijström 2023-04-13 01:46:36 CEST

Hm.
Installed GNOME and it too like Plasma fail to get to showing desktop, but it show a dialog about it and it works to log out.  This is true for both GNOME and GNOME X11, on both kernel desktop 6.2.10-2 and 6.1.14-1.  I dont think i ever tested GNOME on this machine, but...  another thing to test with next ISO.

XFCE and IceWM works like before.

Comment 11 Giuseppe Ghibò 2023-04-13 06:57:14 CEST

(In reply to Morgan Leijström from comment #9)

> Same problem using kernel-linus-6.2.10-1.mga9
> 
> Sidenote:
> Another problem have showed up with Plasma: after last days updates it do
> not get to desktop even with kernel 6.1.14 and a fresh user.  Possibly
> worsened incompatibility with nouveau on this GPU? Or something broke due to
> the hard reboots from hang...  Will do fresh install with beta 2 ISOs when
> available.

I've not noticed this kind of behaviour after plasma update, though the resume problem seems still there. There were also mesa (23.0.2) updates recently.

It could be also that the problems were twos: one with resume after suspend and the other with 3D.

What we can try is also to disable the hardware acceleration for 3D in nouveau driver inxorg.conf, and rely only on llvmpipe software driver. As native it should have a OpenGL compatibility level of 4.x so enough for most of operations. It would be a lot slower, especially if you are on a pretty old CPU, but at least shouldn't hang.

Comment 12 Morgan Leijström 2023-04-13 10:33:43 CEST

As per my comment 6, it got slightly better by disabling hardware acceleration but for IceWM only. Set using drakx11, I dont know how to set it in xorg.conf.  I read somewhere that modern systems rely less on xorg.conf, i.e Plasma overrides things, but this is nut my cup of tea. 


Now I intended to try old nv, but
Bug 31788 - drakx11 offers xorg nv, then tell there is no package x11-driver-video-nv

Comment 13 Morgan Leijström 2023-04-13 11:22:07 CEST

Driver "xorg modesetting" works!  :)
I can use desktop after resume from suspend in both XFCE and IceWM.

$ uname -a
Linux localhost 6.2.10-desktop-2.mga9 #1 SMP PREEMPT_DYNAMIC Mon Apr 10 13:11:58 UTC 2023 x86_64 GNU/Linux


---

Here is from last boot incl a few suspend/resume:  (all *successful*)

$ sudo journalctl -b | grep nouveau
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: vgaarb: deactivate vga console
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: NVIDIA GT218 (0a8600b1)
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: bios: version 70.18.87.00.00
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: fb: 512 MiB DDR3
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: VRAM: 512 MiB
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: TMDS table version 2.0
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB version 4.0
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 00: 01800323 00010034
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 01: 02811300 00000000
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 02: 028223a6 0f220010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 03: 02822362 00020010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 04: 048333b6 0f220010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 05: 04833372 00020010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 06: 088443c6 0f220010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 07: 08844382 00020010
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 00: 00000040
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 01: 00000100
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 02: 00101246
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 03: 00202346
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 04: 00410446
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
apr 13 10:39:13 localhost kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 0
apr 13 10:39:13 localhost kernel: fbcon: nouveaudrmfb (fb0) is primary device
apr 13 10:39:13 localhost kernel: nouveau 0000:01:00.0: [drm] fb0: nouveaudrmfb frame buffer device
apr 13 10:39:16 localhost sensord[1241]: Chip: nouveau-pci-0100
apr 13 10:39:27 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084 failed with error -2
apr 13 10:39:27 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084d failed with error -2
apr 13 10:39:27 localhost kernel: nouveau 0000:01:00.0: msvld: unable to load firmware data
apr 13 10:39:27 localhost kernel: nouveau 0000:01:00.0: msvld: init failed, -19
apr 13 10:45:41 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084 failed with error -2
apr 13 10:45:41 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084d failed with error -2
apr 13 10:45:41 localhost kernel: nouveau 0000:01:00.0: msvld: unable to load firmware data
apr 13 10:45:41 localhost kernel: nouveau 0000:01:00.0: msvld: init failed, -19
apr 13 10:51:23 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084 failed with error -2
apr 13 10:51:23 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084d failed with error -2
apr 13 10:51:23 localhost kernel: nouveau 0000:01:00.0: msvld: unable to load firmware data
apr 13 10:51:23 localhost kernel: nouveau 0000:01:00.0: msvld: init failed, -19
apr 13 10:59:39 localhost sensord[1241]: Chip: nouveau-pci-0100
apr 13 11:13:04 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084 failed with error -2
apr 13 11:13:04 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084d failed with error -2
apr 13 11:13:04 localhost kernel: nouveau 0000:01:00.0: msvld: unable to load firmware data
apr 13 11:13:04 localhost kernel: nouveau 0000:01:00.0: msvld: init failed, -19
apr 13 11:14:44 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084 failed with error -2
apr 13 11:14:44 localhost kernel: nouveau 0000:01:00.0: Direct firmware load for nouveau/nva8_fuc084d failed with error -2
apr 13 11:14:44 localhost kernel: nouveau 0000:01:00.0: msvld: unable to load firmware data
apr 13 11:14:44 localhost kernel: nouveau 0000:01:00.0: msvld: init failed, -19



One odd thing is that "unable to load firmware data" occurred at boot and first two resume, but not two latest resume.
"Chip: nouveau-pci-0100" occured for boot and one resume where it did not get the "unable to load" lines.
But some resumes - both to IceWM and XFCE rendered no journal messages containing "nouveau"
Coincidence with some timing?
* But resume did always work *

Status comment: (none) => Suggest trying xorg modesetting?

Comment 14 Morgan Leijström 2023-04-13 16:47:18 CEST

And now with xorg modesetting, Plasma and GNOME works, both X11 and Wayland, XFCE do not show any artefacts in window borders, and SDDM never flicker when moving mouse.

To summarise: All DE I tried works including resume from suspend correctly

- Except Plasma on Wayland, but for that I blame Plasma itself, as Gnome on Wayland resume OK. Plasma only show pointer on black screen, exit to login using ctrl-alt-bksp.)

Summary: Kernel 6.2 regression resuming from suspend on some systems => Kernel 6.2 regression resuming from suspend on some systems - but works with xorg modesetting

Comment 15 Morgan Leijström 2023-04-13 17:04:24 CEST

https://wiki.mageia.org/en/Mageia_9_Errata#Nvidia

Keywords: FOR_ERRATA9 => IN_ERRATA9

Comment 16 Morgan Leijström 2023-04-13 17:08:22 CEST

I guess there is no feasible way to automate detection and setting modesetting when needed and only when needed neither for install nor upgrade?

What to blame? nouveau, mesa, kernel.. or the combination?

Status comment: Suggest trying xorg modesetting? => (none)

Comment 17 Giuseppe Ghibò 2023-04-14 10:04:27 CEST

Trying to do other automatic switching could be more harmful at this point because we don't know yet the origin.

IMHO the problem arised in the last month, and previously for the Live ISO for instance I wasn't getting the problem in qemu5 (which is using the modesetting driver too and no nouveau involved). So it could be two different problems, first the problem of suspend of X11 for inactivity (but at this point can be only Plasma?) with resulting of the black screen with just the mouse pointer after the resume (maybe problems with DPMS?), and apparently no other visible problems in logs; and second it could be a certain instability with the nouveau driver, which probably was already there. Maybe the latest desktop upgrades just increased the minimal OpenGL capability required to run all the 3D effects, and nouveau just remained lagged.

For using software 3D acceleration with llvmpipe and no other hardware acceleration, you may edit /etc/X11/xorg.conf and try:

Section "Device"
...
    Driver "nouveau"
    Option "NoAccel" "true"
EndSection

while to disable for modesetting:

Section "Device"
...
    Driver "modesetting"
    Option "AccelMethod" "none"
EndSection

Then you can check the OpenGL renderer using

glxinfo -B | grep "OpenGL renderer"

Comment 18 Morgan Leijström 2023-04-14 10:46:03 CEST

(In reply to Giuseppe Ghibò from comment #17)
> IMHO the problem arised in the last month, 

In retrospect, the flicker in SDDM when moving mouse as well as the artefacts in XFCE window borders have been there already several years but i dont remember if i saw it on this specific laptop.  I just did not care much, there was more important stuff to chase, and never tried modesetting before.  But with modesetting those problems too are gone.  What I mean is that nouveau may have always had this problem, it just got worse in combination with i.e mesa and kernel changes.

But this locking up a while after resume is new to me.

> So it could be two
> different problems, first the problem of suspend of X11 for inactivity (but
> at this point can be only Plasma?)

Since a few days neither Plasma nor GNOME in neither Wayland nor X11 mode start *at all* even in initial login without modesetting.

And with modesetting everything is OK in all DE i tried (except resuming Plasma Wayland)

I must admit I should read up on what modesetting actually is...

I might experiment with xorg.conf later, too much private and job backlog...

But first, isnt there something nowadays that writes to xorg.conf automatically, and how to turn that off?

Comment 19 Morgan Leijström 2023-04-18 14:14:45 CEST

Using only drakx11 to set it up, the well working modesetting choice give this section in xorg.conf:

Section "Device"
    Identifier "device1"
    Driver "modesetting"
    Option "DPMS"
EndSection


And if selecting nouveau (resuming badly):

Section "Device"
    Identifier "device1"
    Driver "nouveau"
    Option "DPMS"
EndSection


$ glxinfo -B | grep "OpenGL renderer"
OpenGL renderer string: NVA8


Adding a line to disable hardware acceleration:

Section "Device"
    Identifier "device1"
    Driver "nouveau"
    Option "DPMS"
    Option "NoAccel" "true"    
EndSection

then

$ glxinfo -B | grep "OpenGL renderer"
OpenGL renderer string: llvmpipe (LLVM 15.0.6, 128 bits)

And resuming works.
But performance is much slower than modesetting.

Updated errata.

Comment 20 Giuseppe Ghibò 2023-04-22 10:56:39 CEST

There(In reply to Morgan Leijström from comment #19)
> Using only drakx11 to set it up, the well working modesetting choice give
> this section in xorg.conf:
> 
> Section "Device"
>     Identifier "device1"
>     Driver "modesetting"
>     Option "DPMS"
> EndSection
> 
> 
> And if selecting nouveau (resuming badly):
> 
> Section "Device"
>     Identifier "device1"
>     Driver "nouveau"
>     Option "DPMS"
> EndSection
> 
> 
> $ glxinfo -B | grep "OpenGL renderer"
> OpenGL renderer string: NVA8
> 
> 
> Adding a line to disable hardware acceleration:
> 
> Section "Device"
>     Identifier "device1"
>     Driver "nouveau"
>     Option "DPMS"
>     Option "NoAccel" "true"    
> EndSection
> 
> then
> 
> $ glxinfo -B | grep "OpenGL renderer"
> OpenGL renderer string: llvmpipe (LLVM 15.0.6, 128 bits)
> 
> And resuming works.
> But performance is much slower than modesetting.
> 
> Updated errata.

There is also a new button in XFdrake 1.37 option to do this automatically. It doesn't completely resolve this bug (as problably the problems are in kernel or in latest xorg), but might be helpful.

Comment 21 Morgan Leijström 2023-04-22 22:43:22 CEST

Thank you. Verified.
I now changed errata to describe the GUI method.

Comment 22 Dusan Pavlik 2023-04-27 15:45:54 CEST

Similar problem is with external monitor on mag8 with radeon GPU and kernel 6.2..

This problem is starting with 6.2 kernel.

When the external monitor goes to sleep, it will not wake up.

CC: (none) => pavlikd

Comment 23 psyca 2023-06-28 10:58:37 CEST

Any changes with Kernel 6.3?

CC: (none) => linux

Comment 24 Dusan Pavlik 2023-06-28 11:31:51 CEST

Sorry for my mistake but I meant 6.1 kernel from beckport on mag8.

Comment 25 Morgan Leijström 2023-06-29 22:35:33 CEST

Still need Xorg modesetting.

nouveau hangs on resume from suspend.
I did not try disabling hw accel this time.

Same system, full update incl kernel 6.3.10-desktop-1.mga9

Comment 26 Morgan Leijström 2024-01-29 10:06:33 CET

On my workstation the solution is to use linus kernel:

On my P55 main board with nvidia GTX750, resume from suspend fail to make picture on my monitor unless i power cycle the monitor, when kernel desktop or server is used with nvidia driver. Workarounds 1) manually power cycle monitor after resume, 2) use linus kernel, 3) use nouveau (slow) or modesetting (decent).

Comment 27 Giuseppe Ghibò 2024-01-29 10:11:39 CET

So, as usual 6.6.14-1 .desktop fails and 6.6.14-1 -linus works? 

We might try a sort of "bisect" until we find the offending patch. Here is a version of 6.6.14 with most of patches disabled:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06965290-kernel/

please check whether it works or fails (just install -desktop and -devel).

Comment 28 Morgan Leijström 2024-01-29 10:13:15 CET

Great, I am on :)

Comment 29 Morgan Leijström 2024-01-29 20:08:05 CET

First "bisect" result:
kernel-desktop-6.6.14-1.s1.mga9-1-1.mga9.x86_64 fail

Summary: Kernel 6.2 regression resuming from suspend on some systems - but works with xorg modesetting => Kernel 6.2 regression resuming from suspend on some nvidia systems - works with free drivers or kernel linus

Comment 30 Giuseppe Ghibò 2024-01-30 10:35:24 CET

(In reply to Morgan Leijström from comment #29)

> First "bisect" result:
> kernel-desktop-6.6.14-1.s1.mga9-1-1.mga9.x86_64 fail

Try 6.6.14-1.s2 here:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06971928-kernel/

Comment 31 Morgan Leijström 2024-01-31 08:47:18 CET

6.6.14-1.s2: Success :)    (test incl new microcode and dracut)

Comment 32 Giuseppe Ghibò 2024-01-31 17:55:50 CET

Try 6.6.14-1.s3 here:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06975523-kernel/

Comment 33 Morgan Leijström 2024-01-31 20:43:17 CET

6.6.14-1.s3: Success :)

Comment 34 Giuseppe Ghibò 2024-02-01 15:36:18 CET

Try 6.6.14-1.s4 here:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06977876-kernel/

Comment 35 Morgan Leijström 2024-02-01 17:34:26 CET

6.6.14-1.s4 : success :)

Comment 36 Giuseppe Ghibò 2024-02-01 23:31:14 CET

Try 6.6.14-1.s5 here:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06980096-kernel/

Comment 37 Morgan Leijström 2024-02-02 01:38:12 CET

I find no rpm there, despite https://copr.fedorainfracloud.org/coprs/ghibo/mageia9-bonus/build/6980096 shows all succeeded.
And I also waited ten minutes and looked again.

Comment 38 Giuseppe Ghibò 2024-02-02 10:52:05 CET

You triggered a COPR bug, files are there but are not shown in the generated HTML page which seems an older cached version.

Single URLs is this:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06980096-kernel/kernel-desktop-devel-6.6.14-1.s5.mga9-1-1.mga9.x86_64.rpm

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06980096-kernel/kernel-desktop-6.6.14-1.s5.mga9-1-1.mga9.x86_64.rpm

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06980096-kernel/kernel-6.6.14-1.s5.mga9.src.rpm

Comment 39 Morgan Leijström 2024-02-02 20:06:22 CET

6.6.14-1.s5 : Fail.

Comment 40 Giuseppe Ghibò 2024-02-03 01:10:28 CET

Try 6.6.14-2.s1:

https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06982548-kernel/

Comment 41 Morgan Leijström 2024-02-03 18:59:31 CET

6.6.14-2.s1: Success :)

Comment 42 Giuseppe Ghibò 2024-02-03 19:11:45 CET

Ok, we got it!

The problem seems coming from this patch:

https://svnweb.mageia.org/packages/cauldron/kernel/current/SOURCES/block-floppy-disable-pnp-modalias.patch?view=markup

related to floppy autoloading. Apparently it was introduced in 2012 to fix bug https://bugs.mageia.org/show_bug.cgi?id=4696 

though the mechanism why it interferes right now are pretty weird. 
We might try to disable in a next build.

BTW, did you have a floppy reader installed in your desktop?

Comment 43 Morgan Leijström 2024-02-03 20:06:25 CET

(In reply to Giuseppe Ghibò from comment #42)
> Ok, we got it!

Great :)

 
> BTW, did you have a floppy reader installed in your desktop?

Yes.

Just inserted a 1.44 diskette and clicked on it in Dolphin.
Can list and copy files from it.
It seem i have no right to add or delete files, have not tried to find out why.

Comment 44 Morgan Leijström 2024-02-03 21:28:51 CET

So in 6.6.14-2.s1 we have all patches we usually use except *only* that patch for bug 4696 ?

All tests in this bisect series was with nvidia-current, 535.154.05-1.mga9.
Before this test series i have acknowledged the problem on our usual -desktop kernel using also 470 and 545.

Now as next exercise I switch to nvidia-newfeature-545.29.06-2.mga9 with this 6.6.14-2.s1 kernel.

Comment 45 Morgan Leijström 2024-02-03 22:26:28 CET

(In reply to Morgan Leijström from comment #44)
> Now as next exercise I switch to nvidia-newfeature-545.29.06-2.mga9 with
> this 6.6.14-2.s1 kernel.

Bug 32579 Comment 23 - strikes again.

Comment 46 Giuseppe Ghibò 2024-02-03 22:28:47 CET

(In reply to Morgan Leijström from comment #44)

> So in 6.6.14-2.s1 we have all patches we usually use except *only* that
> patch for bug 4696 ?

Exactly.

> 
> All tests in this bisect series was with nvidia-current, 535.154.05-1.mga9.
> Before this test series i have acknowledged the problem on our usual
> -desktop kernel using also 470 and 545.
> 
> Now as next exercise I switch to nvidia-newfeature-545.29.06-2.mga9 with
> this 6.6.14-2.s1 kernel.

Comment 47 Giuseppe Ghibò 2024-02-03 22:33:38 CET

(In reply to Morgan Leijström from comment #45)

> (In reply to Morgan Leijström from comment #44)
> > Now as next exercise I switch to nvidia-newfeature-545.29.06-2.mga9 with
> > this 6.6.14-2.s1 kernel.
> 
> Bug 32579 Comment 23 - strikes again.

I don't think the patch is responsible for this. Consider that 6.6.14-2.s1 is in "oldversionedscheme" (while 6.6.14-2 was in newversionedscheme), so I think that the nvidia-newfeature probs it's triggered by some condition where the driver temporarely goes unconfigured (e.g. missed devel or something like that) and where drakx11 intervenes.

Morgan Leijström 2024-02-09 23:19:00 CET

Summary: Kernel 6.2 regression resuming from suspend on some nvidia systems - works with free drivers or kernel linus => Kernel 6.2+ regression resuming from suspend on some nvidia systems - works with free drivers or kernel linus

Comment 48 Morgan Leijström 2024-02-10 14:15:39 CET Comment hidden (obsolete)

Created attachment 14360 [details]
Journal from booting same kernel after switcing nvidia 535 to 545

Some times it works, sometimes not, I can not tell a clear pattern.
This time I was running experimental kernel-desktop-6.6.14-2.s1, Bug 31695 Comment 44 and nvidia535, used drakx11 to switch to nvidia-newfeature.
This bug then hit: graphics fail trying to boot that kernel.
OK booting another, dkms-autorebuild works.

This was a while ago, and now I tried again and saved and attach this log.

Instead of showing SDDM login, it boots to black screen with text(?) cursor top left.

I then issued Ctrl-Alt-F4 -> screen full black.
In blined I logged in as root and issued reboot -> worked.
I choosed another kernel (linus) -> dkms autorebuild did its job, and now I
  $ sudo journalctl -b-1 > ThisAttachement  (which I then compressed)

Comment 49 Morgan Leijström 2024-02-10 14:17:33 CET

OOPS wrong bug.

Comment 50 Morgan Leijström 2024-02-10 15:18:23 CET

Next step?

Should we ask other testers to test kernel-desktop-6.6.14-2.s1 for regressions compared to normal -desktop kernel ?

Or more theoretical thinking/analysing first?

Comment 51 Giuseppe Ghibò 2024-02-10 15:26:28 CET

Next step is to include the fix in a next kernel build. Indeed was already done in kernel-6.6.16-1.mgaX, but actually it doesn't actually complete the build yet, because the building fails on the BS for the i586 arch due to VM memory exausting.

Comment 52 Morgan Leijström 2024-02-11 13:38:08 CET

Oh no...

I have swithed to kernel kernel-linus-6.6.14-1.mga9.x86_64
which had no problem before resuming from suspend.
But now this morning, monitor needed power cycle.

Changes to system since it worked:
mesa updated to testing, tainted.

And I removed all i585 packages from the system.
as logged in Bug 32826 Comment 12 to 14

Weird.

Will keep trying to see if it is consistent.
But apparently that kernel patch is not directly involved.
Complicated issue.

Comment 53 Giuseppe Ghibò 2024-02-11 14:10:15 CET

What about 6.6.14-1.s2 (which is basically a 6.6.15)? Does the same occurs?

Comment 54 Morgan Leijström 2024-02-11 15:44:29 CET

1) Repeated the failure with kernel-linus-6.6.14-1

2) 6.6.14-1.s2, which worked before, also fail today.

Now trying elder kernel-linus-6.5.13-2.mga9 which worked before.
Will report when it had a suspend - wait an hour - resume cycle.

Comment 55 Morgan Leijström 2024-02-11 19:47:34 CET

kernel-linus-6.5.13-2.mga9 which worked before also fail now.

Strange that is seem no one else but me have reported this issue.
Maybe it is very special to the combination of kernel, drivers, etc, GPU and its implementation, and of course sleep handling in the monitor.

I suggest to *not* change what patches we apply, unless there is another reason than the issue i am seeing.

Severity: critical => major

Comment 56 Morgan Leijström 2024-02-11 19:59:09 CET

... I am even leaning on setting this as wontfix unless we get some bright ideas soon.

Comment 57 Giuseppe Ghibò 2024-02-11 20:44:16 CET

What exactly you changed from last time it worked to now (package, drivers, etc.)? journalctl could show that.

Last time was pretty easy to trigger and in the end was due to the floppy patch.

Isn't that your floppy hardware is degrading in some way? How about simply trying to disconnect it from the motherboard?

Comment 58 Morgan Leijström 2024-02-12 08:14:04 CET

Hah you are kind of correct.

I had unplugged the diskette drive, because i was trying to see if other drives worked better to read old damaged setup backup diskettes from an old machine (some idiot had stored diskettes close to a transformer).

Plugged in diskette drive, rebooted, suspend over night.
Resumed now, all OK (linus-6.5.13-2)

Will try 6.6.14-1.s2 next.

Thanks for the idea!

Comment 59 Giuseppe Ghibò 2024-02-12 10:04:12 CET

Ok, so now, it's expected to:

- kernel-linus-6.5.13-2.mga9: working
- kernel-desktop-6.5.13-6.mga9: failing (newversionscheme, from updates)
- kernel-linus-6.6.14-1.mga9: working
- kernel-desktop-6.6.14-2.mga9: failing (newversionscheme, from updates)
- kernel-desktop-6.6.14-1.s2: working (oldversionscheme, copr kernel)
- kernel-linus-6.6.16-1.mga9: working (from updates_testing)
- kernel-desktop-6.6.16-1.mga9: working (newversionscheme, from updates_testing)

Comment 60 Morgan Leijström 2024-02-12 11:53:33 CET

(Crazy that having a floppy drive connected or not makes the monitor wake up or not...!)

---

Yes, Comment 59 summarise some of the result of 6.5.13 and 6.6.14

I will try the 6.6.16 linus and desktop after verifying 6.6.14-1.s2 still works.

Hmm... I also see linus 6.6.*14*-1.mga9 building again ?

---

After testing new kernels I will see if I can borrow another monitor, to test with both "bad" and good kernels.
- Manufacturers often interprete standards differently, 
plus standards are often changing...

Comment 61 Giuseppe Ghibò 2024-02-12 12:32:08 CET

(In reply to Morgan Leijström from comment #60)

> (Crazy that having a floppy drive connected or not makes the monitor wake up
> or not...!)

Yes...

> 
> ---
> 
> Yes, Comment 59 summarise some of the result of 6.5.13 and 6.6.14
> 
> I will try the 6.6.16 linus and desktop after verifying 6.6.14-1.s2 still
> works.
> 
> Hmm... I also see linus 6.6.*14*-1.mga9 building again ?

kernel-linus-6.6.14-1.mga9 was a mistake in submitting and I couldn't interrupt the build. But it should be removed. Do not consider it, as it's the actual the same version already in updates.

Comment 62 Morgan Leijström 2024-02-12 22:24:43 CET

(with diskette drive attached...)

kernel-desktop-6.6.14-2.s1.mga9-1-1.mga9.x86: OK
kernel-desktop-6.6.16-1.mga9.x86_64: OK

Comment 63 Morgan Leijström 2024-02-13 10:47:26 CET

kernel-linus-6.6.16-1.mga9.x86_64: OK

Will next try latest -desktop when it is built.

Morgan Leijström 2024-02-13 10:50:23 CET

Status comment: (none) => Deends on patch see c42 and absent diskette drive c58

Comment 64 Morgan Leijström 2024-02-15 00:23:06 CET

kernel-desktop-6.6.16-3.mga9.x86_64: OK

--

sidenote: I started VirtualBox with MSW7 guest, and that worked, but afterwards i see in journal a bunch of lines with call trace, register dump etc. First line:

feb 14 21:48:25 svarten.tribun kernel: WARNING: CPU: 0 PID: 74553 at /var/lib/dkms/virtualbox/7.0.14-1.mga9/build/vboxdrv/r0drv/linux/memobj-r0drv-linux.c:564 rtR0MemObjLinuxApplyPageRange+0x67/0xa0 [vboxdrv]

I can get more lines of course but this should be in a separate bug if so.

The second time i started Virtualbox and guest there is no such bunch of lines, only:

feb 15 00:12:26 svarten.tribun kernel: vboxdrv: 000000003b9337fc VMMR0.r0
feb 15 00:12:27 svarten.tribun kernel: vboxdrv: 00000000145cb2e7 VBoxDDR0.r0

and those lines ware also at the end of the previous session.
kmod was built locally;

$ dkms status|grep 6.6.16-desktop-3
virtualbox, 7.0.14-1.mga9, 6.6.16-desktop-3.mga9, x86_64: installed 
nvidia-newfeature, 545.29.06-2.mga9.nonfree, 6.6.16-desktop-3.mga9, x86_64: installed

Status comment: Deends on patch see c42 and absent diskette drive c58 => Depends on patch see c42, and absent diskette drive c58

Comment 65 Giuseppe Ghibò 2024-02-15 10:36:01 CET

Is that situation happening before or after a resume?

Comment 66 Morgan Leijström 2024-02-15 12:51:46 CET

Verified now it happens after reboot, before suspend.

Comment 67 Giuseppe Ghibò 2024-02-16 11:24:01 CET

There is newer 6.6.14-mga9 in updates_testing.

Did you get more info about the same VBox prob. using dmesg?

Probably is unrelated to this bug, but upstream for virtualbox. Seems pretty similar to these reports:

https://forums.virtualbox.org/viewtopic.php?t=110706

https://www.virtualbox.org/ticket/21964

https://www.virtualbox.org/ticket/21952

Comment 68 Morgan Leijström 2024-02-16 16:31:21 CET

(In reply to Giuseppe Ghibò from comment #67)
> There is newer 6.6.14

I suppose you mean 6.6.16-4: Running now, Same result for VB.

> Probably is unrelated to this bug, but upstream for virtualbox.

Agree, just wanted to feedback while testing here as we have no bug open on 6.6.16 yet.

I see in journal:

§ It happens first time for 6.6.14-1, but I did not then check journal, as the guest works.

§ It did not appear for 6.5.13-2


> Seems pretty similar to these reports:

Yes. All three are very similar to all those lines I see when viewing journal.

I think we should open a bug for this.
- On VirtualBox 7.0.14 I suppose, and set it UPSTREAM, and with the links you gave?

Comment 69 Morgan Leijström 2024-02-16 20:14:36 CET

Created: 

Bug 32858 - UPSTREAM VirtualBox bug in vboxdrv module for kernel 6.6 host create warning in journal

Comment 70 Morgan Leijström 2024-02-17 16:35:03 CET

kernel-desktop-6.6.17-1.mga9: OK resuming

Comment 71 Morgan Leijström 2024-03-04 19:42:02 CET

From Bug 32922 Comment 28 and on:

kernel-desktop-6.6.18-1.mga9.x86_64: OK resuming

kernel-server-6.6.18-1.mga9.x86_64: strangely it may hard hang! :(

kernel-linus-6.6.18-1.mga9.x86_64: OK resuming


vt switching problem is partly back :(

From just referenced bug:

> § FAIL: vt switching Works before a suspend-resume cycle, but never after:
> vt switching ctrl-alt-F6 and back ctrl-alt-F2 fails: hard hang black screen
> with non moveable mouse cursor, even REISUB did not work, user lost work.  I
> also note this problem do *not* exist with linus 6.6.18-1.
> I remember we had this problem before but did we not get it sorted?

Guiseppe replied:
There was a patch/fix for that, that it's still included.

Comment 72 Morgan Leijström 2024-03-04 20:23:31 CET

(In reply to Giuseppe Ghibò from comment Bug 32922 Comment 29)
> Did you get the same fail with 535.xx or newer 470.239.07

You must mean 470.239.06-1, which is what is in testing repo

> or it's specific of 550.54.14?

All tree - now verified with 535.154.05, -desktop kernel.


Now trying kernel-server-6.6.18-1.mga9.x86_64 with nvidia 535.154.05:
It is not as distinctive but maybe one of five times it hangs hard, plus i also experienced three times of maybe ten that it hang hard on resume (not that it failed to wake up the monitor (like earlier -desktop kernels), it really needed power button on computer, did not even respond to REISUB)

-----------


(In reply to Giuseppe Ghibò from comment Bug 32922 Comment 33)

> You might try lstopo utility from hwloc package to detect if there is some weird configuration of PCIe slots, so maybe some PCIe line is shared between too many devices. Mostly these hard locks comes after the resume.

> If you swap the gfx card from one PCIe slot to another? cuda-z might also show a very different bandwidth when placed in one slot or one another.


OK tested:

Running -desktop 6.6.18-1 and nvidia 550.54.14 - now including nvidia-cuda-opencl for the suggested testing


$ lstopo --of txt
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Machine (16GB total)                                                                            │
│                                                                                                 │
│ ┌────────────────────────────────┐  ├┤╶─┬─────┼┤╶───────┬──────────────────────┐  ┌───────────┐ │
│ │ Package L#0                    │      │9,8       9,8  │ PCI 07:00.0          │  │ Block fd0 │ │
│ │                                │      │               │                      │  │           │ │
│ │ ┌────────────────────────────┐ │      │               │ ┌──────────────────┐ │  │ 0 MB      │ │
│ │ │ NUMANode L#0 P#0 (16GB)    │ │      │               │ │ CoProc opencl0d0 │ │  └───────────┘ │
│ │ └────────────────────────────┘ │      │               │ │                  │ │                │
│ │                                │      │               │ │ 4 compute units  │ │                │
│ │ ┌────────────────────────────┐ │      │               │ │                  │ │                │
│ │ │ L3 (8192KB)                │ │      │               │ │ 970 MB           │ │                │
│ │ └────────────────────────────┘ │      │               │ └──────────────────┘ │                │
│ │                                │      │               └──────────────────────┘                │
│ │ ┌────────────┐  ┌────────────┐ │      │                                                       │
│ │ │ L2 (256KB) │  │ L2 (256KB) │ │      ├─────┼┤╶───────┬────────────────┐                      │
│ │ └────────────┘  └────────────┘ │      │0,2       0,2  │ PCI 02:00.0    │                      │
│ │                                │      │               │                │                      │
│ │ ┌────────────┐  ┌────────────┐ │      │               │ ┌────────────┐ │                      │
│ │ │ L1d (32KB) │  │ L1d (32KB) │ │      │               │ │ Net enp2s0 │ │                      │
│ │ └────────────┘  └────────────┘ │      │               │ └────────────┘ │                      │
│ │                                │      │               └────────────────┘                      │
│ │ ┌────────────┐  ┌────────────┐ │      │                                                       │
│ │ │ L1i (32KB) │  │ L1i (32KB) │ │      ├─────┼┤╶─┬─────┬─────────────┐                         │
│ │ └────────────┘  └────────────┘ │      │0,2      │0,2  │ PCI 03:00.0 │                         │
│ │                                │      │         │     └─────────────┘                         │
│ │ ┌────────────┐  ┌────────────┐ │      │         │                                             │
│ │ │ Core L#0   │  │ Core L#1   │ │      │         └─────┬───────────────┐                       │
│ │ │            │  │            │ │      │               │ PCI 03:00.1   │                       │
│ │ │ ┌────────┐ │  │ ┌────────┐ │ │      │               │               │                       │
│ │ │ │ PU L#0 │ │  │ │ PU L#2 │ │ │      │               │ ┌───────────┐ │                       │
│ │ │ │        │ │  │ │        │ │ │      │               │ │ Block sr0 │ │                       │
│ │ │ │  P#0   │ │  │ │  P#1   │ │ │      │               │ │           │ │                       │
│ │ │ └────────┘ │  │ └────────┘ │ │      │               │ │ 1023 MB   │ │                       │
│ │ │ ┌────────┐ │  │ ┌────────┐ │ │      │               │ └───────────┘ │                       │
│ │ │ │ PU L#1 │ │  │ │ PU L#3 │ │ │      │               └───────────────┘                       │
│ │ │ │        │ │  │ │        │ │ │      │                                                       │
│ │ │ │  P#2   │ │  │ │  P#3   │ │ │      └─────┬──────────────────────────────┐                  │
│ │ │ └────────┘ │  │ └────────┘ │ │            │ PCI 00:1f.2                  │                  │
│ │ └────────────┘  └────────────┘ │            │                              │                  │
│ └────────────────────────────────┘            │ ┌───────────┐  ┌───────────┐ │                  │
│                                               │ │ Block sdb │  │ Block sda │ │                  │
│                                               │ │           │  │           │ │                  │
│                                               │ │ 1863 GB   │  │ 465 GB    │ │                  │
│                                               │ └───────────┘  └───────────┘ │                  │
│                                               └──────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘


CUDA-Z Report
=============
Version: 0.11.291 SVN 64 bit Built Apr 28 2023 14:29:21 http://cuda-z.sf.net/
OS Version: Linux 6.6.18-desktop-1.mga9 #1 SMP PREEMPT_DYNAMIC Sat Feb 24 02:17:35 UTC 2024 x86_64
Driver Version: 550.54.14
Driver Dll Version: 12.40 (550.54.14)
Runtime Dll Version: 12.10

Core Information
----------------
	Name: NVIDIA GeForce GTX 750
	Compute Capability: 5.0 (Maxwell)
	Clock Rate: 1084.5 MHz
	PCI Location: 0:7:0
	Multiprocessors: 4 (512 Cores)
	Threads Per Multiproc.: 2048
	Warp Size: 32
	Regs Per Block: 65536
	Threads Per Block: 1024
	Threads Dimensions: 1024 x 1024 x 64
	Grid Dimensions: 2147483647 x 65535 x 65535
	Watchdog Enabled: Yes
	Integrated GPU: No
	Concurrent Kernels: Yes
	Compute Mode: Default
	Stream Priorities: Yes

Memory Information
------------------
	Total Global: 970.25 MiB
	Bus Width: 128 bits
	Clock Rate: 2505 MHz
	Error Correction: No
	L2 Cache Size: 2048 KiB
	Shared Per Block: 48 KiB
	Pitch: 2048 MiB
	Total Constant: 64 KiB
	Texture Alignment: 512 B
	Texture 1D Size: 65536
	Texture 2D Size: 65536 x 65536
	Texture 3D Size: 4096 x 4096 x 4096
	GPU Overlap: Yes
	Map Host Memory: Yes
	Unified Addressing: Yes
	Async Engine: 1 Yes, Unidirectional

Performance Information
-----------------------
Memory Copy
	Host Pinned to Device: 5835.06 MiB/s
	Host Pageable to Device: 4866.85 MiB/s
	Device to Host Pinned: 5037.22 MiB/s
	Device to Host Pageable: 4645.76 MiB/s
	Device to Device: 18.104 GiB/s
GPU Core Performance
	Single-precision Float: 88.5483 Gflop/s
	Double-precision Float: 16.529 Gflop/s
	64-bit Integer: 1151.24 Miop/s
	32-bit Integer: 8679.45 Miop/s
	24-bit Integer: 8230.53 Miop/s

Generated: Mon Mar  4 19:22:03 2024



Switching the Graphics card from the blue slot (top, closest to CPU), to the orange socket two slots down.
The blue is obviosly the fast one, having all pins in the conector, while the orange have most pins missing (less lanes).


[morgan@svarten ~]$ lstopo --of txt
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Machine (16GB total)                                                                            │
│                                                                                                 │
│ ┌────────────────────────────────┐  ├┤╶─┬─────┼┤╶───────┬────────────────┐        ┌───────────┐ │
│ │ Package L#0                    │      │0,2       0,2  │ PCI 02:00.0    │        │ Block fd0 │ │
│ │                                │      │               │                │        │           │ │
│ │ ┌────────────────────────────┐ │      │               │ ┌────────────┐ │        │ 0 MB      │ │
│ │ │ NUMANode L#0 P#0 (16GB)    │ │      │               │ │ Net enp2s0 │ │        └───────────┘ │
│ │ └────────────────────────────┘ │      │               │ └────────────┘ │                      │
│ │                                │      │               └────────────────┘                      │
│ │ ┌────────────────────────────┐ │      │                                                       │
│ │ │ L3 (8192KB)                │ │      ├─────┼┤╶─┬─────┬─────────────┐                         │
│ │ └────────────────────────────┘ │      │0,2      │0,2  │ PCI 03:00.0 │                         │
│ │                                │      │         │     └─────────────┘                         │
│ │ ┌────────────┐  ┌────────────┐ │      │         │                                             │
│ │ │ L2 (256KB) │  │ L2 (256KB) │ │      │         └─────┬───────────────┐                       │
│ │ └────────────┘  └────────────┘ │      │               │ PCI 03:00.1   │                       │
│ │                                │      │               │               │                       │
│ │ ┌────────────┐  ┌────────────┐ │      │               │ ┌───────────┐ │                       │
│ │ │ L1d (32KB) │  │ L1d (32KB) │ │      │               │ │ Block sr0 │ │                       │
│ │ └────────────┘  └────────────┘ │      │               │ │           │ │                       │
│ │                                │      │               │ │ 1023 MB   │ │                       │
│ │ ┌────────────┐  ┌────────────┐ │      │               │ └───────────┘ │                       │
│ │ │ L1i (32KB) │  │ L1i (32KB) │ │      │               └───────────────┘                       │
│ │ └────────────┘  └────────────┘ │      │                                                       │
│ │                                │      ├─────┼┤╶───────┬──────────────────────┐                │
│ │ ┌────────────┐  ┌────────────┐ │      │1,0       1,0  │ PCI 04:00.0          │                │
│ │ │ Core L#0   │  │ Core L#1   │ │      │               │                      │                │
│ │ │            │  │            │ │      │               │ ┌──────────────────┐ │                │
│ │ │ ┌────────┐ │  │ ┌────────┐ │ │      │               │ │ CoProc opencl0d0 │ │                │
│ │ │ │ PU L#0 │ │  │ │ PU L#2 │ │ │      │               │ │                  │ │                │
│ │ │ │        │ │  │ │        │ │ │      │               │ │ 4 compute units  │ │                │
│ │ │ │  P#0   │ │  │ │  P#1   │ │ │      │               │ │                  │ │                │
│ │ │ └────────┘ │  │ └────────┘ │ │      │               │ │ 970 MB           │ │                │
│ │ │ ┌────────┐ │  │ ┌────────┐ │ │      │               │ └──────────────────┘ │                │
│ │ │ │ PU L#1 │ │  │ │ PU L#3 │ │ │      │               └──────────────────────┘                │
│ │ │ │        │ │  │ │        │ │ │      │                                                       │
│ │ │ │  P#2   │ │  │ │  P#3   │ │ │      └─────┬──────────────────────────────┐                  │
│ │ │ └────────┘ │  │ └────────┘ │ │            │ PCI 00:1f.2                  │                  │
│ │ └────────────┘  └────────────┘ │            │                              │                  │
│ └────────────────────────────────┘            │ ┌───────────┐  ┌───────────┐ │                  │
│                                               │ │ Block sdb │  │ Block sda │ │                  │
│                                               │ │           │  │           │ │                  │
│                                               │ │ 1863 GB   │  │ 465 GB    │ │                  │
│                                               │ └───────────┘  └───────────┘ │                  │
│                                               └──────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Host: svarten.tribun                                                                            │
│                                                                                                 │
│ Date: mån  4 mar 2024 19:55:58                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

CUDA-Z Report
=============
   (snipped out identical info here)

Memory Copy
	Host Pinned to Device: 736.296 MiB/s
	Host Pageable to Device: 732.628 MiB/s
	Device to Host Pinned: 788.843 MiB/s
	Device to Host Pageable: 779.851 MiB/s
	Device to Device: 30.8302 GiB/s

====================================================================

RESULT:

( still on -desktop 6.6.18-1 and nvidia 550.54.14 )
Reported "Device to Device" is higher.  What two devices?
Because less lanes to main board, it can communicate faster to another GPU via bridge, if i had one?




Good: neither vt switching nor resuming hangs.

Bad: Painfully sluggish, several times slower, about half speed contra nouveau, so it is not only limited by lower bandwidth, there must be something else operating *much* more inefficiently now.

Bonus quirk: Each time it resume from suspend, optical drive ejects.


Painfully slow, as said - I now switch back to blue socket, and linus 6.6.18-1

Comment 73 Giuseppe Ghibò 2024-03-04 20:45:09 CET

(In reply to Morgan Leijström from comment #72)
> (In reply to Giuseppe Ghibò from comment Bug 32922 Comment 29)
> > Did you get the same fail with 535.xx or newer 470.239.07
> 
> You must mean 470.239.06-1, which is what is in testing repo

yes 470.239.06-1 not 470.239.07 (it was a typo)

Comment 74 Morgan Leijström 2024-03-04 21:16:57 CET

*** Bug 32541 has been marked as a duplicate of this bug. ***

Comment 75 Morgan Leijström 2024-03-04 21:24:30 CET


                                ORIENTATION
                     In this bug, focus have shifted

From: the old GT218M [NVS 3100M], worked around by using Xorg modesetting

To: GM107 problems with nvidia drivers having vt switching hang and hanging (server 6.6.18) or not waking up monitor after resuming from suspend (-desktop before  6.6.18)

Summary: Kernel 6.2+ regression resuming from suspend on some nvidia systems - works with free drivers or kernel linus => Kernel 6.2+ regression resuming from suspend & vt switching problems on some nvidia systems - works with free drivers or kernel linus

Comment 76 Giuseppe Ghibò 2024-03-04 21:47:29 CET

Device to Device means gfx card transfers within itself. Device to Host is from transfer from/to main host RAM to gfx card. You get worst performance in 2nd slot.

Comment 77 Giuseppe Ghibò 2024-03-04 22:06:29 CET

> > or it's specific of 550.54.14?
> 
> All tree - now verified with 535.154.05, -desktop kernel.
> 
> 

So, so it's not specific to 550. 6.6.20-1.mga9 (copr) had any chance to work better?

Comment 78 Morgan Leijström 2024-03-04 22:24:48 CET

kernel-desktop-6.6.20-1.mga9-1-1.mga9.x86_64 with 550: OK so far three iterations each of vt switching and suspend, will report tomorrow or later if i see any problem.

BTW also updated to mesa (non tainted)

Comment 79 Giuseppe Ghibò 2024-03-04 22:40:26 CET

Ok, then il would be fixed in a next round (when it will be ready, as it requires all the other stuff [linus+kmods]).

Comment 80 Morgan Leijström 2024-04-21 23:37:18 CEST

This is resolved :)

Resolution: (none) => FIXED
Status: NEW => RESOLVED

Note You need to log in before you can comment on or make changes to this bug.