Bug 29845 - Suspend/resume only works once (until reboot) if Nouveau is used on a 2-GPU laptop
Summary: Suspend/resume only works once (until reboot) if Nouveau is used on a 2-GPU l...
Status: NEW
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 8
Hardware: x86_64 Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-05 10:22 CET by Omnio Torr
Modified: 2023-10-10 10:54 CEST (History)
2 users (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments
A working sleep (20.08 KB, text/plain)
2022-01-08 10:01 CET, Omnio Torr
Details
A non-working sleep (10.45 KB, text/plain)
2022-01-08 10:02 CET, Omnio Torr
Details

Description Omnio Torr 2022-01-05 10:22:54 CET
Description of problem:
"You only suspend once!"


How reproducible:
Always


Steps to Reproduce:
1. Suspending either to RAM (also called "sleep") or disk (also called "hibernate") work fine when done for the first time after the system boots up.
2. After resuming, neither sleeping or hibernating will work again (when trying to suspend, the screen goes black and after a few seconds the lock screen shows up, asking for the user password).
3. After reboot sleeping and hibernating will work again (but only once, of course).


First I thought this happened because of a buggy BIOS but since suspending works fine (the first time) it means this laptop actually CAN suspend (BTW, this is an ASUS X550CC).

* I tested with Plasma and SDDM (and I used the Plasma live DVD to install).
* I have 2 encrypted partitions (but I doubt this matters in any way).


Best wishes,
Omnio
Comment 1 Dave Hodgins 2022-01-05 18:26:58 CET
systemctl suspend, followed by resume, followed by suspend, followed by resume,
followed by hibernate, followed by resume, works on my laptop.

$ inxi -b
System:    Host: x8t.hodgins.homeip.net Kernel: 5.15.12-server-2.mga8 x86_64 bits: 64 Desktop: KDE 4 Distro: Mageia 8 mga8 
Machine:   Type: Laptop System: ASUSTeK product: TUF Gaming FA506IV_TUF506IV v: 1.0 serial: <superuser required> 
           Mobo: ASUSTeK model: FA506IV v: 1.0 serial: <superuser required> UEFI: American Megatrends v: FA506IV.309 
           date: 07/02/2020 
Battery:   ID-1: BAT1 charge: 85.4 Wh condition: 85.7/90.2 Wh (95%) 
CPU:       Info: 8-Core AMD Ryzen 7 4800H with Radeon Graphics [MCP] speed: 1396 MHz min/max: 1400/2900 MHz 
Graphics:  Device-1: NVIDIA TU106M [GeForce RTX 2060 Mobile] driver: nvidia v: 470.94 
           Device-2: Advanced Micro Devices [AMD/ATI] Renoir driver: amdgpu v: kernel 
           Device-3: IMC Networks USB2.0 HD UVC WebCam type: USB driver: uvcvideo 
           Display: x11 server: Mageia X.org 1.20.14 driver: modesetting,nvidia resolution: 1920x1080~144Hz 
           OpenGL: renderer: NVIDIA GeForce RTX 2060/PCIe/SSE2 v: 4.6.0 NVIDIA 470.94 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
           Device-2: Realtek RTL8822CE 802.11ac PCIe Wireless Network Adapter driver: rtw_8822ce 
Drives:    Local Storage: total: 1.37 TiB used: 202.81 GiB (14.5%) 
Info:      Processes: 413 Uptime: 5m Memory: 15.12 GiB used: 1.47 GiB (9.7%) Shell: Bash inxi: 3.2.01

CC: (none) => davidwhodgins

Comment 2 Lewis Smith 2022-01-06 22:15:17 CET
Thank you Omnio for your report, and apologies for your problem.

Thanks Dave for you comparable tests, with their useful pointers.

@Omnio
1) To describe your system, please post the output (as per Dave) of:
 $ inxi -b

2) In your Description, please say exactly how you:
- suspend/sleep the system;
- hibernate the system;
- resume the system.

3) Re your
"2. After resuming, neither sleeping or hibernating will work again (when trying to suspend, the screen goes black and after a few seconds the lock screen shows up, asking for the user password)."

What happens when you give the password? Does the system continue normally?

4) Please try the commands Dave gave (Dave please comment):
 # systemctl suspend
or
 # systemctl hibernate
then resume (however).
The actions of 'suspend' and 'hibernate' are not defined; the man page says of them:
"This command [suspend] is asynchronous, and will
           return after the suspend operation is successfully enqueued. It
           will not wait for the suspend/resume cycle to complete."
"This command [hibernate] is asynchronous, and
           will return after the hibernation operation is successfully
           enqueued. It will not wait for the hibernate/thaw cycle to
           complete."

CC: (none) => lewyssmith

Comment 3 Omnio Torr 2022-01-08 09:58:53 CET
Thank you for your replies! First I'll answer to Lewis and then I'll describe some tests I did.

1)
$ inxi -b
System:    Host: localhost Kernel: 5.15.11-desktop-3.mga8 x86_64 bits: 64 
           Desktop: KDE Plasma 5.20.4 Distro: Mageia 8 mga8 
Machine:   Type: Laptop System: ASUSTeK product: X550CC v: 1.0 serial: <superuser required> 
           Mobo: ASUSTeK model: X550CC v: 1.0 serial: <superuser required> 
           BIOS: American Megatrends v: X550CC.300 date: 03/24/2014 
Battery:   ID-1: BAT0 charge: 30.3 Wh condition: 30.8/44.2 Wh (70%) 
CPU:       Info: Dual Core Intel Core i5-3337U [MT MCP] speed: 2187 MHz min/max: 800/2700 MHz 
Graphics:  Device-1: Intel 3rd Gen Core processor Graphics driver: i915 v: kernel 
           Device-2: NVIDIA GK208M [GeForce GT 720M] driver: nouveau v: kernel 
           Device-3: Realtek USB2.0 HD UVC WebCam type: USB driver: uvcvideo 
           Display: x11 server: Mageia X.org 1.20.14 driver: intel,v4l resolution: 1366x768~60Hz 
           OpenGL: renderer: Mesa DRI Intel HD Graphics 4000 (IVB GT2) v: 4.2 Mesa 21.3.2 
Network:   Device-1: Qualcomm Atheros AR9485 Wireless Network Adapter driver: ath9k 
           Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
Drives:    Local Storage: total: 577.55 GiB used: 278.97 GiB (48.3%) 
Info:      Processes: 241 Uptime: 22m Memory: 7.66 GiB used: 2.07 GiB (27.0%) Shell: Bash 
           inxi: 3.2.01 

2) - I suspend (or hibernate) using the Plasma Application Menu -> Power/Session -> Sleep (/Hibernate);
   - I resume from sleep pressing any key;
   - I resume from hibernation pressing the power button.

3) Yep. After giving the password I'm back into the session I left when suspending (the normal behavior).

4) When running "systemctl suspend" or "systemctl hibernate" (either as root or as a non-root user) the behavior is exactly the same as in the case of suspending using the Plasma Application Menu (i.e. suspending works only the first time after boot) and no output in the terminal whatsoever.


OK, so what I did today was to open a terminal, start "journalctl -ef" and then try again to suspend a couple of times, so that I can find any possible hints in the logs. I will attach two files, one with the logs from a successful sleep (the first try), and another from an unsuccessful sleep (the second try). What troubles me in the unsuccessful case are these lines:

Jan 08 08:57:05 localhost kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 1 [DRM]
Jan 08 08:57:05 localhost kernel: PM: pci_pm_suspend(): nouveau_pmops_suspend+0x0/0x70 [nouveau] returns -16
Jan 08 08:57:05 localhost kernel: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -16
Jan 08 08:57:05 localhost kernel: nouveau 0000:01:00.0: PM: failed to suspend async: error -16
Jan 08 08:57:05 localhost kernel: PM: Some devices failed to suspend, or early wake event detected
......
Jan 08 08:57:05 localhost systemd-sleep[209317]: Failed to suspend system. System resumed again: Device or resource busy


This is an "optimus" laptop and it seems that for some reason the nvidia GPU is causing problems? I didn't configure nvidia, I just installed using the default intel GPU. I'm not sure I'm on the right path here, I'm rather guessing.


Thanks and best wishes.
Comment 4 Omnio Torr 2022-01-08 10:01:02 CET
Created attachment 13077 [details]
A working sleep
Comment 5 Omnio Torr 2022-01-08 10:02:07 CET
Created attachment 13078 [details]
A non-working sleep
Comment 6 Dave Hodgins 2022-01-08 14:22:33 CET
As shown by comment 1, my laptop is also a hybrid graphics, but with nvidia
and amd, rather then nvidia and intel.

I am using mageia-prime, so that the system uses the nvidia gpu. Before I
installed that, it was using the amd gpu built in to the cpu and ran much
hotter than with mageia-prime installed.

See https://wiki.mageia.org/en/Mageia-prime_for_Optimus for details on it.
and please test with it to see if that fixes the situation.
Comment 7 Lewis Smith 2022-01-09 14:06:55 CET
@Omnio

Thank you for all the information you supplied.

Assuming the two journal extracts are from the same session: the first (OK) ends 08:52, the second (failure) starts 08:56.
Need to study them...

@Dave : I am not sure that mageia-prime should be relevant here. If it does change the situation, we shall need to find out why. Would it equally be worth trying the appropriate nVidia driver rather than nouveau - which is the origin of errors?
Comment 8 Dave Hodgins 2022-01-09 16:52:18 CET
From the non working log, the problem is "kernel: OOM killer enabled."
Once the kernel starts killing processes to free up ram, any system becomes
unusable.

Before the first attempt to suspend or sleep, what's the output of "free -m"?
Comment 9 Lewis Smith 2022-01-09 20:07:55 CET
Thanks.
In the working case, there are just 2 lines for OOM killer:
08:52:22 localhost kernel: OOM killer disabled.
08:52:22 localhost kernel: Freezing remaining freezable tasks
...
08:52:22 localhost kernel: OOM killer enabled
08:52:22 localhost kernel: Restarting tasks ... done.
...
08:52:22 localhost systemd-sleep[189084]: System resumed.
08:52:22 localhost kernel: PM: suspend exit
08:52:22 localhost systemd[1]: systemd-suspend.service: Succeeded.
08:52:22 localhost systemd[1]: Finished Suspend
...
08:52:22 localhost systemd-logind[1139]: Operation 'sleep' finished
-------------------------------------------------------------------
And 4 in the non-working case:
08:56:48 localhost kernel: OOM killer disabled
08:56:48 localhost kernel: Freezing remaining freezable tasks
...
08:56:48 localhost kernel: OOM killer enabled
...
08:56:48 localhost kernel: Restarting tasks ... done
...
08:57:05 localhost kernel: OOM killer disabled.
08:57:05 localhost kernel: Freezing remaining freezable tasks
...
08:57:05 localhost kernel: PM: Some devices failed to suspend, or early wake event detected
...
08:57:05 localhost kernel: OOM killer enabled
...
08:57:05 localhost kernel: Restarting tasks ... done
08:57:05 localhost systemd-sleep[209317]: Failed to suspend system. System resumed again: Device or resource busy
...
08:57:05 localhost systemd[1]: systemd-suspend.service: Failed with result 'exit-code'
08:57:05 localhost systemd[1]: Failed to start Suspend
------------------------------------------------------
So "OOM killer disabled/enabled" looks part of the normal process.

Hand this to kernel? - after the answer to your question.
Comment 10 Omnio Torr 2022-01-10 14:52:28 CET
@Lewis
Sure, both logs are from the same session (two consecutive tries).

@Dave
I took your advice and tried mageia-prime. It installed the proprietary driver and seemed to work fine but I hit another problem, after suspending and resuming the screen was black no matter what (even pressing CTRL+ALT+F1/F2/F3 didn't change it). I recall someone else complaining about this too over IRC a while ago. Anyways, I'm back to intel (I uninstalled all the nvidia stuff).

Here are the results you requested (in chronological order):

Before the first sleep (the one that works):
$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7844         976        5681         128        1186        6480
Swap:          10997           0       10997


Before the second sleep (the one that fails):
$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7844         981        5664         133        1199        6468
Swap:          10997           0       10997


There is something I'd like to add here. As you know, mageia-prime (when switching to the proprietary driver) blacklists the nouveau kernel module (it does that in /etc/modprobe.d/00_mageia-prime.conf). So I made a test: I kept a copy of that file and later, after I uninstalled all the mageia-prime stuff and switched back to intel, I used it to blacklist nouveau. Surprise: when nouveau is blacklisted I can suspend and hibernate fine, on intel, as many times I need to, without rebooting. So I'd say yes, nouveau is the problem, even though it still seems a bit strange that a GPU that I don't use anyway is keeping me from suspending.
Comment 11 Lewis Smith 2022-01-10 21:00:05 CET
Thank you for your many tests & perseverence. And for finding a fix.

> Graphics:
>  Device-1: Intel 3rd Gen Core processor Graphics driver: i915 v: kernel 
>  Device-2: NVIDIA GK208M [GeForce GT 720M] driver: nouveau v: kernel 
> when nouveau is blacklisted I can suspend and hibernate fine, on intel,
> as many times I need to
> nouveau is the problem, even though it still seems a bit strange that
> a GPU that I don't use anyway is keeping me from suspending
Well, in the beginning, the (very nicely chosen) journal extracts only mention Nouveau, also the end of your comment 3. So presumably it was being used when you hit the repeat sleep/suspend problem.

Assigning to kernel/drivers.

CC: lewyssmith => (none)
Assignee: bugsquad => kernel
Summary: After suspending (and resuming), suspending won't work again until reboot => Suspend/resume only works once (until reboot) if Nouveau is used on a 2-GPU laptop

Comment 12 Dave Hodgins 2022-01-11 02:05:54 CET
Mind conducting another test?

Compare the temperatures reported by gkrellm when using the intel gpu vs using
the nvidia gpu and watching a video.

When I installed mageia-prime on my laptop with amd/nvidia gpus, I didn't
expect it to actually work as I thought it was just for intel/nvidia
combinations and was just testing that the updated package installed cleanly.

I was surprised to find it worked, and drastically reduced the system
temperatures.

I'm just wondering if using the nvidia gpu instead of the intel one does the
same.
Comment 13 Omnio Torr 2022-01-12 11:58:40 CET
@Dave

I see. Strange, I don't see any difference here (I checked with gkrellm too). In both cases the cores are at about 48-49 C when watching a movie (and the CPU fan around 2600). When using nvidia, nvidia-settings says the temp increases from 44 C to 48 C when watching a movie. Also I noticed that, with nvidia, the Plasma desktop effects seem a bit slower (like switching between activities or opening submenus within the Application Menu).

Best wishes,
Omnio
Comment 14 Dave Hodgins 2022-01-12 19:43:23 CET
Ok, thanks for the info. Very different from my case.
Comment 15 Dave Hodgins 2022-01-13 13:39:33 CET
Just fyi. Retested my hybrid system without mageia-prime. Sometime since last
March when I last tried it, the amdgpu kernel module has been fixed to no cause
my system to heat up.
Comment 16 Omnio Torr 2022-01-13 14:09:58 CET
Good to hear. I heard some nasty stories about overheating, a friend of mine got his GPU unstuck/unglued/whatever from the mainboard because of it (I don't know the manufacturer).
Comment 17 Morgan Leijström 2023-06-23 09:47:08 CEST
Anything new?

Are you yet on Mageia 9 or can report after upgrade?

Maybe try Live RC... whenever that comes

CC: (none) => fri

Comment 18 Omnio Torr 2023-06-23 10:14:34 CEST
Hi, Morgan. No, I haven't tried anything related to Mageia 9 yet (I fit into the "lazy upgrader" profile). I'll keep an eye on live RCs and give them a try when they'll get around.
Comment 19 Omnio Torr 2023-10-10 10:36:31 CEST
I've just tested Mageia 9 (on two different installs, one with encryption and one without) and the issue is pretty much the same. The symptoms were a bit different (in one case it just refused to hibernate and in the other it seemed to hibernate and instead of resuming it started fresh) but in both cases blacklisting nouveau solved the problem. For any user having this kind of problem and finding this bug report I'm writing a few lines about the fix:

1) Blacklist nouveau by creating a file in /etc/modprobe.d that should look like this:
# cat /etc/modprobe.d/00_nouveau-blacklist.conf 
blacklist nouveau
options nouveau modeset=0
alias nouveau off

2) Blacklisting is not enough (nouveau will still be loaded after reboot). You should create a new initrd:
# dracut -fv

3) Reboot and check (you should see empty output):
$ lsmod | grep -i nouveau
$
Comment 20 Morgan Leijström 2023-10-10 10:43:28 CEST
Thanks for finding this fix for hibernating.

What drivers are now used for the two GPU?
- or are you only using one?

This bug header is about suspend/resume.  Does that work too after that fix?
Comment 21 Omnio Torr 2023-10-10 10:54:36 CEST
Morgan, yes. I was having problems both with sleeping and hibernating but after blacklisting nouveau they both work. Now I use the integrated Intel GPU (inxi says "Intel 3rd Gen Core processor Graphics driver: i915 v: kernel".

Note You need to log in before you can comment on or make changes to this bug.