Bug 32541 - Resume-from-suspend problems using Nvidia drivers on mga9 desktop kernels with GM107
Summary: Resume-from-suspend problems using Nvidia drivers on mga9 desktop kernels wit...
Status: RESOLVED DUPLICATE of bug 31695
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 9
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Giuseppe Ghibò
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-21 01:21 CET by Morgan Leijström
Modified: 2024-03-04 21:16 CET (History)
0 users

See Also:
Source RPM: kernel-6.5.11-5.mga9.src.rpm
CVE:
Status comment:


Attachments

Description Morgan Leijström 2023-11-21 01:21:27 CET
Picked up from https://bugs.mageia.org/show_bug.cgi?id=32537#c7
Discussed now and then in various kernel bugs and dev and qa mail lists with Giuseppe Ghibò since i upgraded the system from mga8 to 9 beta+.

The problem do only appear when using Proprietary drivers, (470, 525, 535, 545), and not nouveau (painfully slow!) nor Xorg modesetting (decent, but proprietary is faster).

___Version-Release number of selected component (if applicable):

§ Any Mageia 9 desktop kernel
§ No problem with linus kernels, currently using 6.5.11-2
§ Mageia 8 desktop kernels was OK on same GPU, main board, CPU

We have tried different DM (i.e SDDM 0.20.0 grom Ghibó) and DE, an all desktop kernels and nvidia versions so far, no result.

Currently using Mageia sddm 0.19, Plasma on X11, Nvidia newfeature 545.29.02, also tested nvidia470 470.223.02-1.


___Hardware

GPU: GM107; NVIDIA GeForce GTX 750, VBIOS version 82.07.32.00.52
Chipset: Intel P55
CPU: Intel i7-870

Monitor: Philips PHL 436M6VBP, connected by DisplayPort

___How reproducible, Steps to Reproduce:

1. Suspend the system
2. Resume (hit a keyboard key)
3. Often the monitor wakes up and show the lock screen, but sometimes monitor only wakes up to tell there is no signal and goes back to sleep. You then have to power cycle the monitor, then it show lock screen.

It seem to fail mostly when system have been sleeping long (maybe monitor go in deeper sleep and do not respond to card quickly enough?  But why would desktop/linus kernels differ?)

---

___Test plan:

A) Retest with server kernel, 6.5.11

B) Retest with kernel-desktop-devel-5.10.191-2.lowlatency.ck.500hz.mga9-1-1.mga9.x86_64 Giuseppe built, from https://copr-be.cloud.fedoraproject.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/, was working in august, still OK with nvidia470?

C) Giuseppe to suggest/build test kernels outside Mageia repos.
Comment 1 Morgan Leijström 2023-11-21 03:20:23 CET
(In reply to Morgan Leijström from comment #0)

> A) Retest with server kernel, 6.5.11

kernel-server-6.5.11-5.mga9.x86_64 works/fails like desktop: short sleep resume OK; then i let it sleep for an hour, then i had to power cycle monitor after resume.   Using newfeature 545.29.02.

Now on to test B, using newfeature 545.29.02 - dkms built the module.
OK short sleep, and now i go to bed myself, testing resume tomorrow.

Then maybe a late lowlatency kernel from same repo.
Comment 2 Morgan Leijström 2023-11-21 11:27:40 CET
Test part B in progress

§ OK: kernel-desktop-devel-5.10.191-2.lowlatency.ck.500hz.mga9-1-1.mga9.x86

§ Fail: kernel-desktop-6.4.16-9.lowlatency.mga9-1-1.mga9.x86_64

Now on to test kernel-desktop-6.4.11-16.lowlatency.mga9-1-1.mga9.x86_64, then kernel-desktop-6.1.47-2.lowlatency.mga9-1-1.mga9.x86_64, choosing blindly
Comment 3 Giuseppe Ghibò 2023-11-21 15:16:19 CET
(In reply to Morgan Leijström from comment #2)

> Test part B in progress
> 
> § OK: kernel-desktop-devel-5.10.191-2.lowlatency.ck.500hz.mga9-1-1.mga9.x86
> 
> § Fail: kernel-desktop-6.4.16-9.lowlatency.mga9-1-1.mga9.x86_64
> 
> Now on to test kernel-desktop-6.4.11-16.lowlatency.mga9-1-1.mga9.x86_64,
> then kernel-desktop-6.1.47-2.lowlatency.mga9-1-1.mga9.x86_64, choosing
> blindly

So, so you narrowed the sleep problems to 1 hour.

There are more kernels focused to the problem, here:

1) standard: 6.5.12-0.1.mga9: https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06671422-kernel/

2) standard 6.5.12-0.2.mga9: https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06671423-kernel/

3) linus 6.5.12-0.1.mga9: https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06671424-kernel-linus/

4) LTS series, low latency: 6.1.63: https://download.copr.fedorainfracloud.org/results/ghibo/mageia9-bonus/mageia-9-x86_64/06671886-kernel/

I've also a low latency 6.5.12-4.mga9, but not yet in copr, see first with 1) and 2).

Since the testing matrix is growing, keep one fixed driver first, e.g. 535.129.03 and see with kernels.
Comment 4 Morgan Leijström 2023-11-21 15:45:00 CET
§ Fail: kernel-desktop-6.4.11-16.lowlatency.mga9-1-1.mga9.x86_64

Now running kernel-desktop-6.1.47-2.lowlatency.mga9-1-1.mga9.x86_64.

I will try your list in numeric order.

I am keeping to newfeature 545.29.02 as that is what i also was testing when starting the test series.
Comment 5 Morgan Leijström 2023-11-21 22:38:58 CET
§ OK: kernel-desktop-6.1.47-2.lowlatency.mga9-1-1.mga9.x86_64

Now going for 1) standard: 6.5.12-0.1.mga9
Comment 6 Morgan Leijström 2023-11-22 00:55:03 CET
Currently running your 1) (desktop kernel-linus-6.5.12-0.1.mga9.x86_64.rpm)

and have also installed 2) ( -0.2 )

In between i also tried to install 3), but 
  urpmi kernel-linus-6.5.12-0.1.mga9.x86_64.rpm
hangs forever using one core, no output at all, have to kill by ctrl-C


(I installed the -devel- file separately first)

Can you check?

( I also tried downloading it again, and also the -devel-, bith identical as first attempt downloaded files. )
Comment 7 Giuseppe Ghibò 2023-11-22 01:21:09 CET
(In reply to Morgan Leijström from comment #6)
> Currently running your 1) (desktop kernel-linus-6.5.12-0.1.mga9.x86_64.rpm)
> 
> and have also installed 2) ( -0.2 )
> 
> In between i also tried to install 3), but 
>   urpmi kernel-linus-6.5.12-0.1.mga9.x86_64.rpm
> hangs forever using one core, no output at all, have to kill by ctrl-C
> 
> 
> (I installed the -devel- file separately first)
> 
> Can you check?

To get the dkms built you need to install -devel them at same time (e.g. urpmi ./kernel-xxx.x86_64.rpm ./kernel-devel-xxx.x86_64.rpm or even urpmi https://copr/..../kernel-xxx.x86_64.rpm https://copr/.../kernel-devel-xxx.x86_64.rpm

(or using dnf or dnfdragora after having added the copr repo).

For 6.5.12-linus, we'll see later.

> 
> ( I also tried downloading it again, and also the -devel-, bith identical as
> first attempt downloaded files. )

So kernel-desktop-6.5.12-0.2.mga9 fails too?
Comment 8 Morgan Leijström 2023-11-22 01:50:04 CET
§ Fail: kernel-desktop-6.5.12-0.1.mga9-1-1.mga9.x86_64

Now running kernel-desktop-6.5.12-0.2, will see tomorrow if it wakes up.

---

For the problem installing linus 6.5.12-0.1.mga9:

I know the corresponding -devel- need be installed at the same time (or before) as the kernel.  I use to put all in one urpmi command, or "urpmi --no-recommends *" ina afolder with only the files i want to install.

I did so also with linus but it hang so to investigate i let it only install -devel- first and that succeeded.  Then when i tell urpmi to install kernel-linus-6.5.12-0.1.mga9.x86_64.rpm it immediately consume exactly one CPU core and perform nothing.  *very* different from normal.

---

When installing lower version, urpmi do not want to downgrade cpupower, kernel-userspace-headers, lib64bpf1.
So I guess it is OK to install and use 6.1.63 kernel, with 6.5.12 versions of above three packages?
- Or should i optimally force that three downgrades when i intend to fully test 6.1.63 kernel?

$ ls 
cpupower-6.1.63-2.lowlatency.mga9.x86_64.rpm
kernel-desktop-6.1.63-2.lowlatency.mga9-1-1.mga9.x86_64.rpm
kernel-desktop-devel-6.1.63-2.lowlatency.mga9-1-1.mga9.x86_64.rpm
kernel-userspace-headers-6.1.63-2.lowlatency.mga9.x86_64.rpm
lib64bpf1-6.1.63-2.lowlatency.mga9.x86_64.rpm

$ LC_ALL=C sudo urpmi --no-recommends *
Some requested packages cannot be installed:
cpupower-6.1.63-2.lowlatency.mga9.x86_64 (in order to keep cpupower-6.5.12-0.2.mga9.x86_64)
kernel-userspace-headers-6.1.63-2.lowlatency.mga9.x86_64 (in order to keep kernel-userspace-headers-6.5.12-0.2.mga9.x86_64)
lib64bpf1-6.1.63-2.lowlatency.mga9.x86_64 (in order to keep lib64bpf1-6.5.12-0.2.mga9.x86_64)

( I use --no-recommends because else it wants to install kernel-desktop-latest 6.5.11-5 )
Comment 9 Giuseppe Ghibò 2023-11-22 02:00:33 CET
6.5.12-0.2 is exected to work at this point...

Yes, it's ok to use --no-recommends too. Note that for this test round you don't need to install all the other libraries lib64bpf, cpupower, userspace, etc., to match the kernel, keep them at stock, so you don't need to downgrade later (being difficult to downgrade it's probably a side-effect of new naming scheme of stock kernels). Using just just kernel-desktop+kernel-desktop-devel RPMs won't interfere and can be easily upgraded/downgraded/removed (since they use the old naming scheme) up and down.
Comment 10 Morgan Leijström 2023-11-22 02:12:55 CET
(In reply to Giuseppe Ghibò from comment #9)
> 6.5.12-0.2 is exected to work at this point...

Not tested long sleep yet - we will see tomorrow.


> Yes, it's ok to use --no-recommends too. Note that for this test round you
> don't need to install all the other libraries lib64bpf, cpupower, userspace

OK, proceeding.
Comment 11 Morgan Leijström 2023-11-22 08:51:24 CET
§ Fail: kernel-desktop-6.5.12-0.2.mga9-1-1.mga9.x86_64

Now on to desktop-6.1.63-2.lowlatency.
Comment 12 Giuseppe Ghibò 2023-11-22 10:49:47 CET
(In reply to Morgan Leijström from comment #11)

> § Fail: kernel-desktop-6.5.12-0.2.mga9-1-1.mga9.x86_64

OK, which means it's not the patch "i2c_nvidia_gpu-change-err-into-info.patch" for bug https://bugzilla.kernel.org/show_bug.cgi?id=206653#c19 that we had in kernel stock but not in kernel-linus.
Comment 13 Giuseppe Ghibò 2023-11-22 11:22:41 CET
(In reply to Morgan Leijström from comment #6)
> Currently running your 1) (desktop kernel-linus-6.5.12-0.1.mga9.x86_64.rpm)
> 
> and have also installed 2) ( -0.2 )
> 
> In between i also tried to install 3), but 
>   urpmi kernel-linus-6.5.12-0.1.mga9.x86_64.rpm
> hangs forever using one core, no output at all, have to kill by ctrl-C
> 
> 
> (I installed the -devel- file separately first)
> 
> Can you check?
> 
> ( I also tried downloading it again, and also the -devel-, bith identical as
> first attempt downloaded files. )

kernel-linus-65.12-0.1.mga9 is the version 6.5.12 plus the stable-queue updated to the day before yesterday.

The kernel-linus RPMs signatures are OK, packages intact so the RPM should be OK. However kernel-linus is in the new naming scheme, so it uses multiple version RPM, which is where urpmi has problems with this kind of installations (up/down/remove). Probably the "hangs forever" is a long timeout/slow-down (even more than half an hour). I hadn't time to produce a kernel-linus in the old naming scheme too as conditional build.

A tip to bypass this is to "not use" urpmi at all, but bypassing installation calling directly "rpm -i" over the packages, e.g.:

rpm -ivh ./kernel-linus-6.5.12-0.1.mga9.x86_64.rpm ./kernel-linus-devel-6.5.12-0.1.mga9.x86_64.rpm

and they will install. With "using one core" what do you mean exactly? I tried booting kernel-linus 6.5.12-0.1 using 1 core only, i.e. passing "maxcpus=1" (which is the parameter to enable one core only) to the kernel boot cmdline and it boots ok.
Comment 14 Morgan Leijström 2023-11-22 13:35:16 CET
§ OK: desktop-6.1.63-2.lowlatency

(In reply to Giuseppe Ghibò from comment #13)
> kernel-linus-65.12-0.1.mga9
...
> rpm -ivh 
> and they will install.

Yep, done.

> With "using one core" what do you mean exactly?

*urpmi* was using one core 100% for several minutes until I hit ctrl C.

Now rebooting to linus-65.12-0.1, expecting dkms autorebuild to make nvidia and vbox modules.
Comment 15 Morgan Leijström 2023-11-22 21:38:57 CET
§ OK: kernel-linus-6.5.12-0.1.mga9.x86_64

Ready to test more :)
Comment 16 Morgan Leijström 2023-11-25 23:25:14 CET
The following test seem to confirm that the problem appear because monitor go in deeper sleep after a while:

When I power on the monitor shortly before resuming the system, also with desktop kernels and long suspend time: the login is displayed without needing to power-cycle the monitor.

--

Anyway, now I have switched back to kernel-linus-6.5.12-0.1, and testing OK with nvidia 545.29.06
Comment 17 Morgan Leijström 2023-12-01 01:02:22 CET
There is also the problem - related or not - that after switching to tty (i.e using ctrl-alt-F4) and then back to desktop, my screen is completely black minus a mouse pointer, frozen.  Hard hang - Not even REISUB works - have to cut power.

These are the last journal lines from that run:

nov 30 23:39:52 svarten.tribun systemd[1]: Started getty@tty4.service.
nov 30 23:39:52 svarten.tribun acpid[8158]: client 12149[0:0] has disconnected
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "alsa_output.pci-0000_00_1b.0.analog-stereo.monitor"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "alsa_output.pci-0000_00_1b.0.analog-stereo.monitor"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "alsa_output.pci-0000_00_1b.0.analog-stereo"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "alsa_output.pci-0000_00_1b.0.analog-stereo.monitor"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "auto_null"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "auto_null.monitor"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "auto_null.monitor"
nov 30 23:39:52 svarten.tribun plasmashell[15710]: org.kde.plasma.pulseaudio: No object for name "auto_null.monitor"
nov 30 23:40:01 svarten.tribun systemd[1]: Started session-14.scope.
nov 30 23:40:01 svarten.tribun CROND[249650]: (morgan) CMD (/usr/bin/nice -n19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime backup-job >/dev/null)
nov 30 23:40:01 svarten.tribun wireplumber[15500]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NameHasNoOwner
nov 30 23:40:01 svarten.tribun CROND[249645]: (morgan) CMDEND (/usr/bin/nice -n19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime backup-job >/dev/null)
nov 30 23:40:02 svarten.tribun python3[249658]: QSettings::value: Empty key passed
nov 30 23:40:02 svarten.tribun python3[249658]: QSettings::value: Empty key passed


------

This seem to have become worse since kernels 5.2 an/or nivia drivers update.

- a month or so ago i could often switch back and fort a couple times before hang, and sometimes mouse pointer was moveable.

Now tested both kernel-desktop-6.5.11-5.mga9.x86_64, and kernel-linus-6.5.11-2.mga9.x86_64, with nvidia 535.129.03, and both failed first try. I only tested one time each, this is a production system...
Comment 18 Giuseppe Ghibò 2023-12-07 19:34:14 CET
(In reply to Morgan Leijström from comment #17)
> There is also the problem - related or not - that after switching to tty
> (i.e using ctrl-alt-F4) and then back to desktop, my screen is completely
> black minus a mouse pointer, frozen.  Hard hang - Not even REISUB works -
> have to cut power.
> 

There is kernel-desktop-6.5.13-2.mga9 in backports testing (just install -desktop -devel, as it's oldnamingscheme) which should fix the VT problem.

Not your missed monitor resume after longer suspend, which is not tracked yet.
Comment 19 Morgan Leijström 2023-12-07 23:20:16 CET
(In reply to Giuseppe Ghibò from comment #18)
> (In reply to Morgan Leijström from comment #17)
> 
> There is kernel-desktop-6.5.13-2.mga9 in backports testing 
> which should fix the VT problem.

I confirm that problem seem to be fixed; tested a few iterations tty4, 6, and 1 (desktop) switching.
I will shout if i see it again.

Thank you.

I will keep running -desktop-6.5.13-2 for a while.

Using nvidia 470.223.02-1
Comment 20 Morgan Leijström 2024-03-04 21:16:57 CET
Handling of this issue apparently have melded into Bug 31695

As described there:

§ The vt hang problem have partially reappeared

§ resume from suspend mostly works for -desktop, not always for -server

*** This bug has been marked as a duplicate of bug 31695 ***

Status: NEW => RESOLVED
Resolution: (none) => DUPLICATE


Note You need to log in before you can comment on or make changes to this bug.