Bug 27357 - Recurring core notifier timeout with Nouveau driver involving loss GUI
Summary: Recurring core notifier timeout with Nouveau driver involving loss GUI
Status: REOPENED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 8
Hardware: x86_64 Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL: https://www.dailymotion.com/video/x7w...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-04 11:24 CEST by kalagani kalagani
Modified: 2022-09-09 11:42 CEST (History)
7 users (show)

See Also:
Source RPM:
CVE:
Status comment: x11-driver-video-nouveau-1.0.16-3.mga7-> x11-driver-video-nouveau-1.0.17-1.mga8


Attachments
journalctl -b (216.95 KB, text/plain)
2020-10-04 11:26 CEST, kalagani kalagani
Details
Xorg.0.log (31.84 KB, text/plain)
2020-10-04 11:29 CEST, kalagani kalagani
Details
xorg.conf (790 bytes, text/plain)
2020-10-04 11:30 CEST, kalagani kalagani
Details
dmesg (73.22 KB, text/plain)
2020-10-04 11:31 CEST, kalagani kalagani
Details
systemctl status (11.70 KB, application/octet-stream)
2020-10-04 11:32 CEST, kalagani kalagani
Details
systemctl list-unit-files (24.78 KB, text/plain)
2020-10-04 11:34 CEST, kalagani kalagani
Details
inxi -Fm (2.90 KB, text/plain)
2020-10-04 11:36 CEST, kalagani kalagani
Details
rpm -qa --last (191.00 KB, text/plain)
2020-10-04 11:37 CEST, kalagani kalagani
Details
journalctl |grep -E "Reboot --|nouveau 0000:18:00.0: DRM|nouveaudrm" (44.00 KB, text/plain)
2020-10-04 11:40 CEST, kalagani kalagani
Details
journalctl -b with "ifplugd invoked oom-killer" (227.70 KB, text/plain)
2020-10-07 18:14 CEST, kalagani kalagani
Details
dmesg with "ifplugd invoked oom-killer" (91.69 KB, text/plain)
2020-10-07 18:16 CEST, kalagani kalagani
Details
journalctl -p4 -b -1 (222 "core notifier timeout") (48.36 KB, text/plain)
2020-10-08 19:48 CEST, kalagani kalagani
Details
creeIndexPhotos.bash (17.29 KB, application/x-shellscript)
2020-10-11 12:50 CEST, kalagani kalagani
Details
communFunctions.include (13.94 KB, application/x-shellscript)
2020-10-11 12:54 CEST, kalagani kalagani
Details
communDefine.include (7.71 KB, text/plain)
2020-10-11 12:55 CEST, kalagani kalagani
Details
example of contact sheet obtained with the script creeIndexPhotos (514.35 KB, image/jpeg)
2020-10-11 13:05 CEST, kalagani kalagani
Details
journalctl --no-hostame -b (38.12 KB, application/x-xz)
2020-10-12 23:11 CEST, kalagani kalagani
Details
Xorg.0.log 2020-10-12 (5.98 KB, application/x-xz)
2020-10-12 23:13 CEST, kalagani kalagani
Details
journalctl --no-hostame -b -1 with loss GUI (44.68 KB, application/x-xz)
2020-10-13 11:30 CEST, kalagani kalagani
Details
Snapshot XFCE logout session involving: "The X11 connection broke (error 1). Did the X11 server die?" (134.05 KB, image/png)
2020-10-13 16:12 CEST, kalagani kalagani
Details
journalctl -b with new x11-server-xorg-1.20.9-1.1 (88.51 KB, application/x-xz)
2020-10-27 16:15 CET, kalagani kalagani
Details
top > top.log with new x11-server-xorg-1.20.9-1.1 (553.55 KB, application/x-xz)
2020-10-27 16:19 CET, kalagani kalagani
Details
nouveau core notifier timeout with following trap (816.31 KB, text/plain)
2022-01-20 17:34 CET, kalagani kalagani
Details
journalctl --since "2022-04-01" --until "2022-04-30" --no-hostname |grep -E "drivers/gpu/drm/nouveau|- Reboot|DRM: base|DRM: core" (50.60 KB, text/plain)
2022-04-30 18:29 CEST, kalagani kalagani
Details

Description kalagani kalagani 2020-10-04 11:24:27 CEST
Description of problem:
Previously in MGA5 without this pb, skipping MGA6
I installed MGA7/Plasma with the DVD formatting only / system partition
Encountering many problems with Plasma and Nvidia driver resulting in repeated freeze.
Suspecting old configuration files from MGA4.1/MGA5,
I reinstalled the same way a second time formatting also the home partition.
Unfortunately even freezes, so I switched to the Nouvau driver.
this time no freezes but a recurring loss of GUI, subject of this report!
Not believing too much, I still changed the graphics card by exactly the same one without any improvement.

Nevertheless the PC remains accessible via ssh. Everything seems normal, none services in fault! 
But every time GUI is lost, journalctl -b always shows the same error:
kernel: nouveau 0000:18:00.0: DRM: core notifier timeout

Version-Release number of selected component (if applicable):
5.7.19-desktop-1.mga7
x11-driver-video-nouveau-1.0.16-3.mga7
lib64drm_nouveau2-2.4.101-2.mga7

Attached files about last loss:
journalctl-b2020-10-03XfceNouveauNotComposite.log
Xorg.0.log2020-10-03XfceNouveauNotComposite.log
xorg.conf2020-10-03XfceNouveauNotComposite.log
dmesg2020-10-03XfceNouveauNotComposite.log
systemctlStatus2020-10-03XfceNouveauNotComposite.log
systemctlListUnitFiles2020-10-03XfceNouveauNotComposite.log

Hardware used and installed RPM
inxi-Fm2020-10-03XfxeNouveauNotComposite.log
rpmInstalled2020-10-03XfceNouveauNotComposite.log
For information, 
journalctlRecurringCoreNotifierTimeoutSince2020-10-22.log
summary of core notifier timeout since September 22, 2020
thanks to the command:
journalctl |grep -E "Reboot --|nouveau 0000:18:00.0: DRM|nouveaudrm"
Comment 1 kalagani kalagani 2020-10-04 11:26:49 CEST
Created attachment 11900 [details]
journalctl -b
Comment 2 kalagani kalagani 2020-10-04 11:29:28 CEST
Created attachment 11901 [details]
Xorg.0.log
Comment 3 kalagani kalagani 2020-10-04 11:30:44 CEST
Created attachment 11902 [details]
xorg.conf
Comment 4 kalagani kalagani 2020-10-04 11:31:47 CEST
Created attachment 11903 [details]
dmesg
Comment 5 kalagani kalagani 2020-10-04 11:32:44 CEST
Created attachment 11904 [details]
systemctl status
Comment 6 kalagani kalagani 2020-10-04 11:34:42 CEST
Created attachment 11905 [details]
systemctl list-unit-files
Comment 7 kalagani kalagani 2020-10-04 11:36:00 CEST
Created attachment 11906 [details]
inxi -Fm

My hardware: HPxw9400 station
Comment 8 kalagani kalagani 2020-10-04 11:37:50 CEST
Created attachment 11907 [details]
rpm -qa --last

Installed RPM since second MGA7 installation
Comment 9 kalagani kalagani 2020-10-04 11:40:00 CEST
Created attachment 11908 [details]
journalctl |grep -E "Reboot --|nouveau 0000:18:00.0: DRM|nouveaudrm"

Summary of core notifier timeout since September 22, 2020
Comment 10 kalagani kalagani 2020-10-04 11:52:51 CEST
Exchanges on the MLO forum about my freeze and GUI loss problems with MGA7
https://www.mageialinux-online.org/forum/topic-27878-5+freeze-mga7.php#m274291
None pb previously with same hardware on MGA5!

And about the specific loss of GUI already seen with Nouveau
https://bugzilla.redhat.com/show_bug.cgi?id=1618906
https://bugzilla.redhat.com/show_bug.cgi?id=1655788

URL: (none) => https://www.mageialinux-online.org/forum/topic-27878-5+freeze-mga7.php#m274291

Comment 11 Aurelien Oudelet 2020-10-05 15:45:50 CEST
Hi, thanks for reporting this bug.

Your graphic card is a Quadro FX 5600 which is see as a NV50 (G80) Nvidia's Tesla family card.
Current nouveau driver from freedesktop.org supports for this card seems at least OK.

Moreover, nvidia nonfree drivers upstream recommands are 340.108 and Mageia 7 provides this:
nvidia340-340.108-7.mga7.nonfree

Is this machine UEFI or BIOS? Do you want to use nonfree drivers or nouveau?

Suggestion:
Remove "vga" options on kernel command line, and try again.
Try passing GRUB this value in /etc/default/grub:
GRUB_GFXPAYLOAD_LINUX=keep
GRUB_GFXMODE={your screen resolution here}

This is done to help framebuffer / Console to have good option to start.


Try also to add "nouveau.modeset=1" to kernel command line.

CC: (none) => ouaurelien
Component: Release (media or process) => RPM Packages
Status comment: (none) => x11-driver-video-nouveau-1.0.16-3.mga7
Keywords: (none) => NEEDINFO

Comment 12 kalagani kalagani 2020-10-05 18:56:28 CEST
Hello Aurelien,
yes this is a Quadro FX 5600 and machine is BIOS.

Previously I used the nvidia340-340.108-7.mga7.nonfree driver under Plasma but it was worse and the PC froze with NVRM: Xid ... Graphics Exception in the journalctl.
So I switched to XFCE, always with the nvidia driver, but also problems with! 
I reported this on the MLO forum
https://www.mageialinux-online.org/forum/topic-27878-1+mga7-freeze-ou-perte-gui.php

It is from there that I passed to the Nouveau driver.
Alas under Plasma full of artifacts making the use almost impossible.
I have posted below 2 screenshots showing this
https://www.mageialinux-online.org/forum/topic-27103-2+plantages-recurrents.php#m274425
So, a new time I switched to XFCE...to fall into the problem subject of this bug!!!
For information, I have the impression that the last pb occurs when a virtualBox machine is launched. Indeed the 2 previous boots went well...chance????
Thanks for your suggestions, I will try!
Comment 13 Lewis Smith 2020-10-06 16:16:27 CEST
Thank you for all the documentation you provided.

Just to support Aurelien' notes.
I have tried to establish the relationship between 'Quadro FX 5600' as reported correctly by inxi, and 'G80GL' as seen partially (G80) in dmesg and the X log (GeForce 8 (G8x)). To ensure that they are talking about the same thing.

Wikipedia: "The 8800 series, codenamed G80". The 'G80' seems to be the GPU, for which the Nvidia driver 340.108 says (2007) "Added support for Quadro FX 4600 and Quadro FX 5600", clear enough. As for Nouveau:
 NOUVEAU driver for NVIDIA chipset families : GeForce 8 (G8x)
which, academic though it may be, seems to point the same way. It all goes back a long time.

> I have the impression that the last pb occurs when a virtualBox machine
> is launched.
> Indeed the 2 previous boots went well...
Does this imply that you had *not* launched VB? Can you please test this specifically? It should be easy.

Perusing the long MLO forum thread, I am dubious about a hardware cause:
- It all worked under Mageia 5.
- "I still changed the graphics card by exactly the same one without any improvement" (which looks like "I changed the card for a similar one").

There is one thing I have not seen mentioned (even if it was), but it is very easy - except you loose all work running; which you seemed willing to do when you say in the forum you remotely killed various applications: try killing the X-server with:
 Ctl/Alt/Bksp/Bksp
which should return you to the display manager login screen. And see whether that restores the GUI. It is much quicker than re-booting.

CC: (none) => lewyssmith

Comment 14 kalagani kalagani 2020-10-06 17:18:25 CEST
(In reply to Lewis Smith from comment #13)
> I have tried to establish the relationship between 'Quadro FX 5600' as
> reported correctly by inxi, and 'G80GL' as seen partially (G80) in dmesg and
> the X log (GeForce 8 (G8x)). To ensure that they are talking about the same
> thing.
> Wikipedia: "The 8800 series, codenamed G80". The 'G80' seems to be the GPU,
> for which the Nvidia driver 340.108 says (2007) "Added support for Quadro FX
> 4600 and Quadro FX 5600", clear enough. As for Nouveau:
>  NOUVEAU driver for NVIDIA chipset families : GeForce 8 (G8x)
> which, academic though it may be, seems to point the same way. It all goes
> back a long time.
I agree your relationship, to avoid mistake, I had done the same thing for the correspondence...

> > I have the impression that the last pb occurs when a virtualBox machine
> > is launched.
> > Indeed the 2 previous boots went well...
> Does this imply that you had *not* launched VB? Can you please test this
> specifically? It should be easy.
Yes, for these 2 previous boot, VB was not launched, 
You are true for test (always with lightdm/XFCE/Nouveau) at each time I launch VB...and also Firefox, Thunderbird and terminal!
These are my test conditions.
For boot in progress in these conditions...I am waiting for GUI loss or not?
(core notifier timeout in journalctl) 

> There is one thing I have not seen mentioned (even if it was), but it is
> very easy - except you loose all work running; which you seemed willing to
> do when you say in the forum you remotely killed various applications: try
> killing the X-server with:
>  Ctl/Alt/Bksp/Bksp
> which should return you to the display manager login screen. And see whether
> that restores the GUI. It is much quicker than re-booting.
On the PC without GUI, neither Ctrl+Alt+F2 or Ctrl+Alt+Bksp (once or twice) were operating.
So on another remote PC, by ssh, I tried to restart the server with no result except once,
by relaunching the lightdm service then restoring the login screen!
At end, I stop with a shutdown -h now
Comment 15 kalagani kalagani 2020-10-07 18:10:24 CEST
Hello,
> You are true for test (always with lightdm/XFCE/Nouveau) at each time I
> launch VB...and also Firefox, Thunderbird and terminal!
> These are my test conditions.
> For boot in progress in these conditions...I am waiting for GUI loss or not?
> (core notifier timeout in journalctl) 
Yesterday this boot in progress in these conditions ended without pb.
None error from Nouveau.

But today, same conditions, the "core notifier timeout" occurs but without loss of GUI!!! So, I write from the PC not with ssh!!! This is a new behavior.

journalctl shows several errors of this type and then one call to
"ifplugd invoked oom-killer" followed by a dump
then several other errors "core notifier timeout"!
This call/dump seems to be linked to a problem of memory...
I can't say if there is a relationship between this and the "core notifier timeout" from Nouveau.
Comment 16 kalagani kalagani 2020-10-07 18:14:35 CEST
Created attachment 11916 [details]
journalctl -b with "ifplugd invoked oom-killer"

Line at the end of dump:
Out of memory: Killed process 3477 (montage) total-vm:53805684kB, anon-rss:17697676kB, file-rss:0kB, shmem-rss:11716372kB, UID:1000 pgtables:71336kB oom_score_adj:0
Comment 17 kalagani kalagani 2020-10-07 18:16:02 CEST
Created attachment 11917 [details]
dmesg with "ifplugd invoked oom-killer"
Comment 18 kalagani kalagani 2020-10-07 18:43:42 CEST
The memory pb, above, appears to be related to the swap that was filled in!
free -h
              total       utilisé      libre     partagé tamp/cache   disponible
Mem:           31Gi       4,0Gi        14Gi        11Gi        12Gi        15Gi
Partition d'échange:       8,0Gi       935Mi       7,1Gi
This free command shows that the swap is almost still at the limit while a lot of RAM is free.
Would there be a liberation that would hurt?
Comment 19 kalagani kalagani 2020-10-07 21:57:33 CEST
Hi,
having understood that oss-killer is the way for the system in case of saturated memory to get out of it by killing the process that would cause the saturation. 
I could see that this killed process (montage) was linked to a script that I had launched in // of my test conditions.
> journalctl -b with "ifplugd invoked oom-killer"
> 
> Line at the end of dump:
> Out of memory: Killed process 3477 (montage) total-vm:53805684kB,
> anon-rss:17697676kB, file-rss:0kB, shmem-rss:11716372kB, UID:1000
> pgtables:71336kB oom_score_adj:0
I was even able to reproduce it live by relaunching this same script.

from journalctl
...
oct. 07 20:04:40 HPxw9400 kernel: [  24193]  1000 24193  6214209  5978556 48037888        0             0 montage
oct. 07 20:04:40 HPxw9400 kernel: [  25468]  1000 25468 67167682      600   208896        0             0 baloo_file_extr
oct. 07 20:04:40 HPxw9400 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-2.scope,task=montage,pid=24193,uid=1000
oct. 07 20:04:40 HPxw9400 kernel: Out of memory: Killed process 24193 (montage) total-vm:24856836kB, anon-rss:23914224kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:46912kB oom_score_adj:0
oct. 07 20:04:40 HPxw9400 kernel: oom_reaper: reaped process 24193 (montage), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oct. 07 20:05:28 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
oct. 07 20:05:47 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
oct. 07 20:06:10 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout

From my script, same stopped 24193 PID
2020-10-07 16:46:25 != 2020-10-07 16:46:25 => CREATION planche contact sur "/mnt/E/patrick/Documents/Thunderbird/Profiles/patrick.default/cache2/entries" PAS DU TYPE: aaaa-mm-jj-t_
 traitement de 1839 photos
./creeIndexPhotos.bash : ligne 110 : 24193 Processus arrêté      montage -size 500x500 *_thumb.png -geometry 200x+0+0 -title "$1" -tile 6x -quality 100 "$1"_000_index.jpg 2>> "$1"_000_montage.error

Now this script, even though it causes core timeouts, does not cause any loss of GUI, which is not the case so far!
By the way, we can see that the baloo PID consumes more memory (total_vm) than the montage PID.
journalctl -b |grep -E "baloo_file|montage|uid  tgid total_vm"

oct. 07 17:06:09 HPxw9400 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
oct. 07 17:06:11 HPxw9400 kernel: [   5926]  1000  5926 67167403      588   442368     1125             0 baloo_file
oct. 07 17:06:11 HPxw9400 kernel: [   3477]  1000  3477 13451421  7353512 73048064  1744714             0 montage
oct. 07 17:06:11 HPxw9400 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-2.scope,task=montage,pid=3477,uid=1000
oct. 07 17:06:11 HPxw9400 kernel: Out of memory: Killed process 3477 (montage) total-vm:53805684kB, anon-rss:17697676kB, file-rss:0kB, shmem-rss:11716372kB, UID:1000 pgtables:71336kB oom_score_adj:0
oct. 07 17:06:11 HPxw9400 kernel: oom_reaper: reaped process 3477 (montage), now anon-rss:0kB, file-rss:0kB, shmem-rss:11716372kB
...
oct. 07 20:04:37 HPxw9400 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
oct. 07 20:04:39 HPxw9400 kernel: [   5926]  1000  5926 67167403      671   442368     1074             0 baloo_file
oct. 07 20:04:40 HPxw9400 kernel: [  24193]  1000 24193  6214209  5978556 48037888        0             0 montage
oct. 07 20:04:40 HPxw9400 kernel: [  25468]  1000 25468 67167682      600   208896        0             0 baloo_file_extr
oct. 07 20:04:40 HPxw9400 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-2.scope,task=montage,pid=24193,uid=1000
oct. 07 20:04:40 HPxw9400 kernel: Out of memory: Killed process 24193 (montage) total-vm:24856836kB, anon-rss:23914224kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:46912kB oom_score_adj:0
oct. 07 20:04:40 HPxw9400 kernel: oom_reaper: reaped process 24193 (montage), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So, I am thinking that these cases with oos-killer are not the root cause of the GUI!!!
PS: montage is a tool from Imagemagick, I use it in the script to make a "contact sheet" (planche contact in french) of a photo directory.
Comment 20 kalagani kalagani 2020-10-08 19:48:52 CEST
Created attachment 11921 [details]
journalctl -p4 -b -1 (222 "core notifier timeout")
Comment 21 kalagani kalagani 2020-10-08 19:52:23 CEST
Hi,
End of the boot story (comment 15 to comment 19) with the 2 oom-killer calls:
It ended up freezing in the evening.
The ssh of the other PC showed a 100% CPU occupation for the VirtualBoxVM PID
and a big memory occupation.

Tâches: 233 total,   2 en cours, 231 en veille,   0 arrêté,   0 zombie
%Cpu0  :  2,3 ut, 97,3 sy,  0,0 ni,  0,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  :  3,0 ut,  2,4 sy,  0,0 ni, 85,1 id,  9,5 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu2  :  2,7 ut,  1,0 sy,  0,0 ni, 92,7 id,  3,7 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  2,4 ut,  2,4 sy,  0,0 ni, 89,6 id,  5,7 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem : 29,4/32165,1  [||||||||||||||||||||||||||||||                                                                      ]
MiB Éch : 93,3/8206,0   [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||       ]

  PID UTIL.     PR  NI    VIRT    RES    SHR S  %CPU  %MEM    TEMPS+ COM.  
27038 patrick   20   0 5059356   2,2g   2,2g R 103,6   7,1 196:36.09 VirtualBoxVM

free -h
              total       utilisé      libre     partagé tamp/cache   disponible
Mem:           31Gi       4,4Gi        17Gi       4,5Gi       9,9Gi        22Gi
Swap:          8,0Gi       7,5Gi       551Mi

This occupation ended when I succeeded in switching off (in fact recording the state of) the remote virtual machine (W10).
I then directly regained access to the PC.

The series of errors "core notifier timeout" started a little before the 1st oom-killer call to continue until the 2nd (this time caused) oom-killer call. The oom-killer repairs did not make them disappear. 
The PC was not then frozen, only a certain slowness to change the window appeared, until it turned into a real freeze.
After recovery the PC these errors were still emitted and this until the shutdown. (222 in total, cf the journalctl -p4 -b -1 in attachment 11921 [details] above).
Comment 22 Lewis Smith 2020-10-08 22:29:29 CEST
Thank you for all your dogged research.

Going back to comment 14:
> On the PC without GUI, neither Ctrl+Alt+F2 or Ctrl+Alt+Bksp (once or twice)
> were operating.
Not being able to get to a virtual console (Ctrl+Alt+F2-7) or re-start X (Ctrl+Alt+Bksp ... twice) implies to me a problem deeper than the graphics.

* Can youy say, from comment 0:
> Nevertheless the PC remains accessible via ssh. Everything seems normal
whether if you re-start X via SSH, the GUI + login screen re-appears?

From comment 15:
> But today, same conditions, the "core notifier timeout" occurs but without
> loss of GUI!!! So, I write from the PC not with ssh!!!
Again this points away from the video driver messages - rather, their cause - being the source of the problem. Perhaps it too is suffering from lack of memory, and these messages are another symptom of the memory problems you noted subsequently.

* Can you say with confidence that the loss of the GUI definitely follows the use of (a) particular application(s), memory-hungry, notably VBox?
[In fact you have a huge amount of memory, 32Gb, so swap seems superfluous. It is unusual to make it (8Gb) less than real memory, though.]

In the last attachment (I think we have enough journals now, thank you) from comment 20, is:
oct. 07 17:06:09 HPxw9400 kernel: Free swap  = 0kB
oct. 07 17:06:09 HPxw9400 kernel: Total swap = 8402908kB
 which is crazy. Immediately followed by:
oct. 07 17:06:11 HPxw9400 kernel: Out of memory: Killed process 3477 (montage) total-vm:53805684kB, anon-rss:17697676kB, file-rss:0kB, shmem-rss:11716372kB, UID:1000 pgtables:71336kB oom_score_adj:0
oct. 07 17:08:23 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
 the last repeated many times. Which supports the idea that nouveau itself is suffering from lack of memory.

Your investigations point to a memory usage problem, which you are pursuing. Await your further conclusions.
Comment 23 kalagani kalagani 2020-10-09 12:30:33 CEST
Hi,

> Thank you for all your dogged research.
I like to understand if it can help the team...and I'm learning!

> Going back to comment 14:
> > On the PC without GUI, neither Ctrl+Alt+F2 or Ctrl+Alt+Bksp (once or twice)
> > were operating.
> Not being able to get to a virtual console (Ctrl+Alt+F2-7) or re-start X
> (Ctrl+Alt+Bksp ... twice) implies to me a problem deeper than the graphics.
Maybe, but why not under MGA5 I didn't encounter these problems?
(described in Description and Comment 12)

> * Can youy say, from comment 0:
> > Nevertheless the PC remains accessible via ssh. Everything seems normal
> whether if you re-start X via SSH, the GUI + login screen re-appears?
Under ssh, I launched startx as user and as root, failure!

> From comment 15:
> > But today, same conditions, the "core notifier timeout" occurs but without
> > loss of GUI!!! So, I write from the PC not with ssh!!!
> Again this points away from the video driver messages - rather, their cause
> - being the source of the problem. Perhaps it too is suffering from lack of
> memory, and these messages are another symptom of the memory problems you
> noted subsequently.
Same remark as above, under MGA5 with KDE and the Nvidia driver, none of the problems encountered under MGA7 with Plasma and the Nvidia driver, to the point that I switched to XFCE and the Nouveau driver! To fall also in this pb of loss of GUI.

> * Can you say with confidence that the loss of the GUI definitely follows
> the use of (a) particular application(s), memory-hungry, notably VBox?
My test conditions are ligthdm/XFCE/New with Thunderbird, Firefox, 1 or 2 terminals and Virtualbox launched, I am not able to confirm my impression that this is the cause or something else.

> In the last attachment (I think we have enough journals now, thank you) from
> comment 20, is:
> oct. 07 17:06:09 HPxw9400 kernel: Free swap  = 0kB
> oct. 07 17:06:09 HPxw9400 kernel: Total swap = 8402908kB
>  which is crazy. Immediately followed by:
> oct. 07 17:06:11 HPxw9400 kernel: Out of memory: Killed process 3477
> (montage) total-vm:53805684kB, anon-rss:17697676kB, file-rss:0kB,
> shmem-rss:11716372kB, UID:1000 pgtables:71336kB oom_score_adj:0
> oct. 07 17:08:23 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier
> timeout
>  the last repeated many times. Which supports the idea that nouveau itself
> is suffering from lack of memory.
> 
> Your investigations point to a memory usage problem, which you are pursuing.
> Await your further conclusions.
The novelty is that with my script involving montage (Imagemagick) I am able to reproduce an oom-killer call despite 32+8GB.
Under MGA5 and MGA7 I also used this same script but not in the condition that generated the Out of memory.
A grep on journalctl (since the MGA7 installation) shows only these 2 calls to oom-killer
journalctl |grep oom-killer
oct. 07 17:06:02 HPxw9400 kernel: ifplugd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
oct. 07 20:04:26 HPxw9400 kernel: gpm invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Thus, the GUI losses encountered did not involve this particular case.
Nevertheless now I launch top to monitor in real time the memory consumption.

Yesterday (without script in use) no problems
even when the VirtualBoxVM  PID is 100% CPU (the second top à 22:10)
top - 15:27:19 up  6:03,  1 user,  load average: 1,35, 1,06, 1,02
Tâches: 221 total,   1 en cours, 220 en veille,   0 arrêté,   0 zombie
%Cpu0  : 10,6 ut, 11,6 sy,  0,0 ni, 76,8 id,  1,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  :  8,8 ut,  5,4 sy,  0,0 ni, 84,1 id,  1,7 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu2  :  7,9 ut, 19,1 sy,  0,0 ni, 72,3 id,  0,7 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  7,4 ut,  4,7 sy,  0,0 ni, 85,6 id,  2,3 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem : 17,7/32165,1  [||||||||||||                                                       ]
MiB Éch :  0,0/8206,0   [                                                                   ]

  PID UTIL.     PR  NI    VIRT    RES    SHR S  %CPU  %MEM    TEMPS+ COM.                     
26260 patrick   20   0 5041728   2,3g   2,2g S  24,2   7,4 108:54.64 VirtualBoxVM             

free -h
              total       utilisé      libre     partagé tamp/cache   disponible
Mem:           31Gi       4,9Gi        16Gi       240Mi        10Gi        25Gi
Swap:         8,0Gi          0B       8,0Gi

top - 22:10:28 up 12:46,  1 user,  load average: 0,74, 0,59, 0,57
Tâches: 225 total,   2 en cours, 223 en veille,   0 arrêté,   0 zombie
%Cpu0  :  4,7 ut,  2,7 sy,  0,0 ni, 91,7 id,  1,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  :  4,4 ut,  3,1 sy,  0,0 ni, 89,5 id,  3,1 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu2  :  2,3 ut, 97,7 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  : 10,0 ut,  3,0 sy,  0,0 ni, 83,6 id,  3,3 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem : 18,5/32165,1  [||||||||||||                                                       ]
MiB Éch :  0,0/8206,0   [                                                                   ]

  PID UTIL.     PR  NI    VIRT    RES    SHR S  %CPU  %MEM    TEMPS+ COM.                     
26260 patrick   20   0 5053840   2,4g   2,2g S 104,0   7,5 225:48.23 VirtualBoxVM             
2

free -h
              total       utilisé      libre     partagé tamp/cache   disponible
Mem:           31Gi       5,1Gi        14Gi       217Mi        11Gi        25Gi
Swap:          8,0Gi          0B       8,0Gi
Comment 24 Lewis Smith 2020-10-09 22:10:48 CEST
Thank you for this further information.

> The novelty is that with my script involving montage (Imagemagick)
> I am able to reproduce an oom-killer call despite 32+8GB.
> PS: montage is a tool from Imagemagick, I use it in the script to
> make a "contact sheet" (planche contact in french) of a photo directory.
> Yesterday (without script in use) no problems
> even when the VirtualBoxVM  PID is 100% CPU
I am wondering now about the 'montage' script; it looks as if VBox is in the clear. Imagemagick has certainly changed between MGA5 & 7.
* Can you please attach the script?
* Is it operating on the same image collection as under MGA5?
* Has the number of images dealt with by the script grown significantly?

Your use of 'top' was a good idea. Could you try it while running the script to see how the 'MiB Mem' (libre, utilisée) evolves - whether it stays more-or-less the same, or grows?
Also keep an eye on the %MEM for the script process(es) line(s) (Imagemagick ?)
---
Did you ever try the kernel suggestions from comment 11?
Comment 25 kalagani kalagani 2020-10-11 12:43:45 CEST
Hello,
> I am wondering now about the 'montage' script; it looks as if VBox is in the
> clear. Imagemagick has certainly changed between MGA5 & 7.
imagemagick-desktop-6.9.5.2-1.mga5
imagemagick-7.0.8.62-1.mga7.tainted

> * Can you please attach the script?
yes and 2 include files

> * Is it operating on the same image collection as under MGA5?
yes

> * Has the number of images dealt with by the script grown significantly?
no
By default the script scans recursively a directory "Photos".
If new directory or modification of one of the subdirectories 
then creation of an index file (involving convert and mount, both of imagemagick)
Always by default are excluded directories that have nothing to do with pictures.
Nevertheless I can re-insert some of these excluded files, this is what happened when oom-killer was called.
The reintroduced directory (thunderbird's cache) contained 1839 images (cf Comment 19)
whereas the biggest of the real "Photos" directories contains 402!
> 
> Your use of 'top' was a good idea. Could you try it while running the script
> to see how the 'MiB Mem' (libre, utilisée) evolves - whether it stays
> more-or-less the same, or grows?
> Also keep an eye on the %MEM for the script process(es) line(s) (Imagemagick
> ?)
To do this, I deleted all the index files in the "Photos" directory to force a total recreation that lasted 258mm (4h18mmn).
Result: even if from time to time 100% CPU occupation by virtualBoxVM,
the top bargraph never exceeded 23.9% /9 bars for RAM
while the swap has always remained at 0!!!!
The PC has never frozen or lost its GUI, and the journalctl does not show any core notifier timeout alarm.
> Did you ever try the kernel suggestions from comment 11?
Yes, I forgot to say that because I don't really see the connection with the alrames arriving hours after the launch...
Modification removing vga and GRUB_GFXMODE=1900x1200
GRUB_GFXPAYLOAD_LINUX=keep
since
oct. 06 09:12:37 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.7.19-desktop-1.mga7 root=UUID=8ebeda33-9710-49a7-93a4-809d13e2809f ro splash quiet noiswmd resume=UUID=2fdb0460-b082-496e-b748-e904514ac886 audit=0

URL: https://www.mageialinux-online.org/forum/topic-27878-5+freeze-mga7.php#m274291 => https://www.dailymotion.com/video/x7wrcmv

Comment 26 kalagani kalagani 2020-10-11 12:50:29 CEST
Created attachment 11924 [details]
creeIndexPhotos.bash

My script to create a "contact sheet" for each photo directory
Comment 27 kalagani kalagani 2020-10-11 12:54:51 CEST
Created attachment 11925 [details]
communFunctions.include

File embedding functions shared by several scripts
Comment 28 kalagani kalagani 2020-10-11 12:55:55 CEST
Created attachment 11926 [details]
communDefine.include

File embedding define shared by several scripts
Comment 29 kalagani kalagani 2020-10-11 13:05:01 CEST
Created attachment 11927 [details]
example of contact sheet obtained with the script creeIndexPhotos
Comment 30 kalagani kalagani 2020-10-11 19:46:24 CEST
Hi,
in the Comment 25 no pb detected...
But the next day (saturday), same conditions + digikam launched, with several use of my script involving convert and montage.
At the end of the evening, without any problem so far, I was alerted by a slowdown when passing from one window to another...(the script was not in action), quickly made an eye with journalctl...to see the now well known errors of "core notifier timeout"...
I then had, as if to monitor the memory with top, the idea of monitoring the log with journalctl -f
And to my surprise I was able to generate "core notifier timeout" by simply clicking on any task in the taskbar!!!!

So much so that I took a video of it
https://www.dailymotion.com/video/x7wrcmv
(also in URL at beginning)
Pay attention to the concomitance between the clicks of the cursor at the bottom and the generation of errors in the left terminal.
On the right you can also see the top terminal, which does not indicate anything special...
I didn't try to drown the "core notifier timeout" system to see if I lost the GUI...disappointed I went to bed...
Comment 31 Lewis Smith 2020-10-11 22:59:04 CEST
Sooner than me!
Once again, thank you for your painstaking research, and this intelligent testing & report - which does point the finger at X. In the light of which, I change tack and ask:

The next time you get a loss of GUI - but still have remote access - please do as quickly as possible (and post or attach the ps & top outputs):
$ ps -e
$ top
# journalctl -b --no-hostname > a_file
Confirm that Ctrl/Alt/F2-6, or Ctrl/Alt/Bksp/Bksp, has no effect on the crippled box.

Edit the saved journal file to note the exact point (as best as you can judge) when you lost the GUI. Then compress it with xz, and attach that, saying what you were doing at the time (applications, scripts). It is often difficult to pin down events in journals extracted retrospectively.

Also please compress & attach /var/log/Xorg.0.log

Regret asking similar things again. Trying to document it *when it happens*.
Aurelien is following this. Do you have any other ideas?
Comment 32 Aurelien Oudelet 2020-10-11 23:15:48 CEST
I do read all of this and it mâles me feel there is memory leak somewhere. Perhaps in Nouveau driver, perhaps in script.
Seems OP lost GUI control by getting Out Of Memory.
Comment 33 kalagani kalagani 2020-10-12 22:28:46 CEST
Hi, sorry, I write in live...report a segfault in libgobject-2

all day long, I "played" with my script, also recovering in exceptional cases that involved oom-killer. Which one did happen.
Unlike the 2 "exceptional" cases already mentioned, here no "core notifier timeout" fault, no freeze.

I run this script one last time always in the exceptional case but with a limit memory in montage command (montage -limit memory 16GiB...)
so oom-killer not called
...............................19:37
1054 2020-10-12 19:37:26 : time /home/patrick/Photos/scriptPhoto/creeIndexPhotos.bash

...............................20:01
turn off the screen, go to eat, 
2020-10-12 20:01:29 : h

...............................20:39
turn on the screen, logging and and look at the result of the script
look at the result of the script
2020-10-12 20:39:58 : ls -rtl
2020-10-12 20:40:10 : more entries_000_montage.error
 which is error
more entries_000_montage.error
mount: unable to write pixel cache '/tmp/magick-12060i3U61rLnnOpK': No device space available @ error/cache.c/WritePixelCachePixels/5830.
mount: unable to extend cache 'entries_000_INDEX_2070.jpg': No device space available @ error/cache.c/OpenPixelCache/3888.
mount: unable to extend cache 'entries_000_INDEX_2070.jpg': No device space available @ error/cache.c/OpenPixelCache/3888.
mounting: Maximum supported image dimension is 65500 pixels `entries_000_INDEX_2070.jpg' @ error/jpeg.c/JPEGErrorHandler/343.

and I discover a 1st "core notifier timeout".
Oct. 12 20:12:01 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
the script ended at
ll entries_000_montage.error
-rwxr-xr-x 1 patrick patrick 575 oct. 12 20:19 entries_000_montage.error*
This means that this 1st "core notifier timeout" occurred during the script.
A cause and effect relationship????

From there, I went back to the generator of this type of alarm by clicking on the icons in the taskbar (see previous video) but not only because leaving emacs, manipulating Thunderbird, Firefox, Thunar and the CCM also produces it
...with for this last a segfault...
Oct. 12 21:20:05 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:20:14 HPxw9400 drakrpm [8564]: ### Program is exiting ###
Oct. 12 21:20:18 HPxw9400 drakrpm-update [32201]: ### Program is starting ###
Oct. 12 21:20:20 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:20:23 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:20:25 HPxw9400 drakrpm-update [32201]: opening the RPM database
Oct. 12 21:20:25 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:20:34 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:20:34 HPxw9400 drakrpm-update [32201]: opening the RPM database
Oct. 12 21:20:34 HPxw9400 drakrpm-update [32201]: opening the RPM database
Oct. 12 21:20:34 HPxw9400 drakrpm-update [32201]: opening the RPM database
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Chip: k8temp-pci-00cb
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Adapter: PCI adapter
Oct. 12 21:20:37 HPxw9400 sensord [1348]:   Core0 Temp: 40.0 C
Oct. 12 21:20:37 HPxw9400 sensord [1348]:   Core1 Temp: 41.0 C
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Chip: new-pci-1800
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Adapter: PCI adapter
Oct. 12 21:20:37 HPxw9400 sensord[1348]: temp1: 42.0 C (limit = 95.0 C, hysteresis = 3.0 C)
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Chip: k8temp-pci-00c3
Oct. 12 21:20:37 HPxw9400 sensord [1348]: Adapter: PCI adapter
Oct. 12 21:20:37 HPxw9400 sensord [1348]:   Core0 Temp: 40.0 C
Oct. 12 21:20:37 HPxw9400 sensord [1348]:   Core1 Temp: 40.0 C
Oct. 12 21:20:56 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:21:01 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout
Oct. 12 21:21:01 HPxw9400 drakrpm-update [32201]: ### Program is exiting ###
Oct. 12 21:21:39 HPxw9400 drakconf[6353]: modified file /etc/mcc.conf
Oct. 12 21:21:39 HPxw9400 drakconf[6353]: ### Program is exiting ###
Oct. 12 21:21:39 HPxw9400 kernel: drakconf[6353]: segfault at 51 ip 00007ff97a770130 sp 00007ffee7c6d300 error 4 in libgobject-2.0.so.0.6000.2[7ff97a758000+31000].
Oct. 12 21:21:39 HPxw9400 kernel: Code: 89 44 24 5c 41 f6 44 24 18 10 74 56 89 c1 48 8b 05 3d 45 03 00 48 85 c0 74 48 48 89 da eb 0b 0f 1f 00 48 8b 00 48 85 c0 74 38 <48> 3b 50 08 75 f2 3b 48 10 75 ed 8b 74 24 58 3b 70 14 75 e4 c7 40
Oct. 12 21:21:41 HPxw9400 kernel: new 0000:18:00.0: DRM: core notifier timeout

Switching from a window or a work area is slow but there is no loss of GUI
top does not show anything abnormal in RAM and swap
Xorg.0.log either
Comment 34 kalagani kalagani 2020-10-12 23:11:29 CEST
Created attachment 11931 [details]
journalctl --no-hostame -b
Comment 35 kalagani kalagani 2020-10-12 23:13:32 CEST
Created attachment 11932 [details]
Xorg.0.log 2020-10-12
Comment 36 kalagani kalagani 2020-10-13 11:26:33 CEST
Hello,
after writing the Comment33 I shut down the PC not as usual by the "Eteindre" button in the top right corner but by logging out with "Mageia->Déconnexion" button à left...
then I lost the GUI!!!!
I tried the:
> Confirm that Ctrl/Alt/F2-6, or Ctrl/Alt/Bksp/Bksp, has no effect on the crippled box.
I confirm without result!
via ssh I didn't see anything abnormal, except that there was another Xorg log...normal????
I restarted lightm.service without finding the GUI again.
So in root shutdown -h now the PC...

And this morning looking at the journal log I discovered some broken X11 and Nouveau errors mixed with Xorg.

> Edit the saved journal file to note the exact point (as best as you can judge)
I mark this with POINT I JUDGE TO LOSS GUI in the new attached journal, so just after you can see 
oct. 12 23:53:09 kdeinit5[29214]: kdeinit5: Fatal IO error: client killed
oct. 12 23:53:10 at-spi-bus-launcher[9974]: X connection to :0 broken (explicit kill or server shutdown).
oct. 12 23:53:09 klauncher[29215]: The X11 connection broke (error 1). Did the X11 server die?
oct. 12 23:53:09 kactivitymanagerd[29186]: The X11 connection broke (error 1). Did the X11 server die?
oct. 12 23:53:10 kglobalaccel5[29195]: The X11 connection broke (error 1). Did the X11 server die?
oct. 12 23:53:10 unknown[9991]: mate-screensaver: Fatal IO error 11 (Ressource temporairement non disponible) on X server :0.

Then the mixed errors
oct. 13 00:26:10 kernel: nouveau 0000:18:00.0: Xorg[3305]: failed to idle channel 3 [Xorg[3305]]
occurs after remote ssh login

Also, below the commands typed under ssh (ended with su for shutdown)
  968  2020-10-13 00:04:47 : top
  969  2020-10-13 00:05:10 : more /var/log/Xorg.0.log
  970  2020-10-13 00:06:31 : ll /var/log/Xorg.0.log
  971  2020-10-13 00:06:49 : more /var/log/Xorg.0.log.old 
  972  2020-10-13 00:07:20 : cd memos
  973  2020-10-13 00:07:26 : cd pbInstallMageia7/
  974  2020-10-13 00:07:54 : ls -rtl
  975  2020-10-13 00:09:15 : more /var/log/Xorg.0.log > Xorg.0.log.deconnect2020-10-12.log
  976  2020-10-13 00:12:50 : Ctrl/Alt/Bksp/Bksp
  977  2020-10-13 00:13:04 : systemctl status
  978  2020-10-13 00:15:38 : systemctl is-system-running
  979  2020-10-13 00:16:04 : systemctl |grep -i failed
  980  2020-10-13 00:16:41 : systemctl status network.service
  981  2020-10-13 00:17:30 : journalctl -no-hostname -b
  982  2020-10-13 00:17:38 : journalctl --no-hostname -b
  983  2020-10-13 00:19:30 : ll
  984  2020-10-13 00:20:00 : ls -rtl
  985  2020-10-13 00:20:19 : journalctl --no-hostname -xe
  986  2020-10-13 00:21:30 : ls -rtl
  987  2020-10-13 00:22:08 : journalctl --no-hostname -b > journalctl-b.deconnect2020-10-12.log
  988  2020-10-13 00:22:27 : journalctl --no-hostname -b
  989  2020-10-13 00:23:19 : systemctl status
  990  2020-10-13 00:23:50 : systemctl status |grep -i xorg
  991  2020-10-13 00:25:12 : systemctl status lightdm.service
  992  2020-10-13 00:25:42 : systemctl restart lightdm.service
  993  2020-10-13 00:29:08 : systemctl status lightdm.service
  994  2020-10-13 00:29:46 : systemctl status network.service
  995  2020-10-13 00:30:53 : systemctl restart network.service
  996  2020-10-13 00:31:16 : journalctl -xe
  997  2020-10-13 00:32:01 : systemctl status network.service
  998  2020-10-13 00:32:28 : top
  999  2020-10-13 00:34:06 : ps -e
 1000  2020-10-13 00:35:02 : su -
The 992, 995, 1000 marked in the journal with COMMAND FROM SSH
Comment 37 kalagani kalagani 2020-10-13 11:30:50 CEST
Created attachment 11933 [details]
journalctl --no-hostame -b -1 with loss GUI

Modified with marking:
more journalctl-b1lossGUI2020-10-12.log | grep -E "POINT|COMMAND "
POINT I JUDGE TO LOSS GUI
COMMAND FROM SSH REMOTE LOGIN
COMMAND FROM SSH with sudo= 992  2020-10-13 00:25:42 : systemctl restart lightdm.service
COMMAND FROM SSH with sudo= 995  2020-10-13 00:30:53 : systemctl restart network.service
COMMAND FROM SSH with su 1000  2020-10-13 00:35:02 : su -
Comment 38 Lewis Smith 2020-10-13 16:04:49 CEST
This was written *before* c36
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Thanks you for the last report and journal/log attachments. The journal particularly has things of interest.

[The video said it was private, so could not view it]
[Quickly: "drakconf[6353]: segfault" when it exits is regrettably common...]

CC'ing Dave Hodgins in case he has any ideas.
Similarly the kernel/drivers team.
For the last journal attached, there are 2 out-of-memory (oom) in 15.54. For the rest, start at 20.00.

I would be tempted to constantly monitor memory usage - when the programs of interest are running (which the script caters for, so you can leave it running all the time: it only outputs when an interesting program is running) - with a script along the lines:-
----------------------------------------------------------
# Script to monitor defined processes at defined interval
# To save the output, use $ ./scriptname | tee logfile

# Define here programs of interest as shown by 'ps' with leading space
# This is to eliminate spurious other hits in the 'ps' output
PROGS=' <prog1>| <prog2>| <etc>'

clear
while true
do
# Only log when something is running we are interested in
# grep's own entry + argument always shows, so cut it out
 if ps ax | grep -v grep | grep -E "$PROGS" > /dev/null
 then
  echo
  date
  ps ax | grep -v grep | grep -E "$PROGS"
# Limit top O/P (via head), otherwise it lists all processes
# You can adjust the 'head' parameter to give an exact screenful
  echo
  top -b -n1 | head -n19
 fi
# Set the interval here, seconds
 sleep 60
done
-------------------------------------------------------------
I have tried it, it works. Run on a spare terminal|virtual console as:
 $ ./scriptname | tee logfile
to see & preserve the output.

One can wonder whether this obsession with memory is relevant. Could it be *video memory* which was mentioned in other threads on Nouveau+nVidia ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P.S. Bug 27319 is about Xserver crashing, not the same thing I know, but it produced a patched x11-server-xorg currently in core/updates_testing pending release: x11-server-xorg-1.20.9-1.1.mga7
See https://bugs.mageia.org/show_bug.cgi?id=27319#c11 onwards.
This includes several patches, so is worth trying even if it makes no difference. It should be out soon.

CC: (none) => davidwhodgins, kernel

Comment 39 kalagani kalagani 2020-10-13 16:12:08 CEST
Created attachment 11934 [details]
Snapshot XFCE logout session involving: "The X11 connection broke (error 1). Did the X11 server die?"

Snapshot XFCE showing:
PC shutdown not directly with the "Shutdown" button, but first by logging off the user and then "Shutdown".
This because directly with the "Shutdown" button the current session is not saved for the next start, with the second method it is!
Comment 40 kalagani kalagani 2020-10-13 16:18:10 CEST
Hi, for information
Out of curiosity I turned off the PC as in the screenshot Comment 39,
this time no GUI loss but the same X11 error messages:

oct. 13 13:17:02 lightdm[4335]: Error opening audit socket: Protocol not supported
oct. 13 13:17:02 lightdm[6376]: pam_unix(lightdm:session): session closed for user patrick
oct. 13 13:17:02 systemd-logind[1093]: Session 2 logged out. Waiting for processes to exit.
oct. 13 13:17:02 lightdm[6376]: pam_kwallet5(lightdm:session): pam_kwallet5: pam_sm_close_session
oct. 13 13:17:02 lightdm[6376]: pam_kwallet(lightdm:session): pam_kwallet: pam_sm_close_session
oct. 13 13:17:02 lightdm[6376]: pam_kwallet5(lightdm:setcred): pam_kwallet5: pam_sm_setcred
oct. 13 13:17:02 lightdm[6376]: pam_kwallet(lightdm:setcred): pam_kwallet: pam_sm_setcred
oct. 13 13:17:02 at-spi-bus-launcher[27178]: X connection to :0 broken (explicit kill or server shutdown).
oct. 13 13:17:02 unknown[27195]: mate-screensaver: Fatal IO error 11 (Ressource temporairement non disponible) on X server :0.
oct. 13 13:17:02 polkitd[1434]: Unregistered Authentication Agent for unix-session:2 (system bus name :1.74, object path /org/mate/Pol>
oct. 13 13:17:02 klauncher[7607]: The X11 connection broke (error 1). Did the X11 server die?
oct. 13 13:17:02 kglobalaccel5[7602]: The X11 connection broke (error 1). Did the X11 server die?
oct. 13 13:17:02 kactivitymanagerd[7594]: The X11 connection broke (error 1). Did the X11 server die?
oct. 13 13:17:02 kdeinit5[7606]: kdeinit5: Fatal IO error: client killed
oct. 13 13:17:02 acpid[1069]: client 4345[0:0] has disconnected
oct. 13 13:17:02 acpid[1069]: client connected from 19272[0:0]
Comment 41 Aurelien Oudelet 2020-10-13 16:24:40 CEST
Hi,

No, you use XFCE but with some Plasma5 services. When XFCE starts his logout functions, it does not handle Plasma5 services and when X11 server dies, they complains.

Lightdm seems to handle correctly the logout and waits for remaining process to quit. See second journal line in comment 40.
Comment 42 kalagani kalagani 2020-10-13 17:00:44 CEST
Hello Aurelien,

> No, you use XFCE but with some Plasma5 services. When XFCE starts his logout
> functions, it does not handle Plasma5 services and when X11 server dies,
> they complains.
> 
> Lightdm seems to handle correctly the logout and waits for remaining process
> to quit. See second journal line in comment 40.
 I understand, till now, never I shutdown PC like this, it was always with the button "Eteindre".
So, when I retrieved PC without GUI, it was after a long inactivity and the loggin box in normal case should have been presented to me, but it was not the case: gray screen what I call GUI loss.

Thus changing the PC shutdown method, I wondered if this did not highlight a hidden problem with the "Shutdown" button shutdown method.
Comment 43 kalagani kalagani 2020-10-13 17:50:29 CEST
Hi,
 
> [The video said it was private, so could not view it]
sorry, now, it is public...

> For the last journal attached, there are 2 out-of-memory (oom) in 15.54. For
> the rest, start at 20.00.
yes but only one call to oom-killer at 15.54, none at 20!!!
I knew that there was going to be an out-of-memory because I had put my script in condition for that: working on a directory of ~2400 photos when normally it doesn't exceed 402
Before this oom there was no core notifier timeout and the first one that happens is about ~4h later
oct. 12 20:12:01 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
nevertheless, my script had been launched again (with no oom condition) and ended at this moment like I described my actions in Comment 33

> I would be tempted to constantly monitor memory usage - when the programs of
> interest are running (which the script caters for, so you can leave it
> running all the time: it only outputs when an interesting program is
> running) - with a script along the lines:-
Thanks for your script!
Like it is about PID and top, I have a top.log file (top > top.log) related to the same trace, but refused because it's over 1000k, I can truncate the not useful beginning to attach it.

> One can wonder whether this obsession with memory is relevant. Could it be
> *video memory* which was mentioned in other threads on Nouveau+nVidia ?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I am thinking like you,
before this problem with Nouveau/XFCE, I was with Nvidia/Plasma but it was worse, hence my switch to Nouveau/XFCE less worse! See Comment 12

> P.S. Bug 27319 is about Xserver crashing, not the same thing I know, but it
> produced a patched x11-server-xorg currently in core/updates_testing pending
> release: x11-server-xorg-1.20.9-1.1.mga7
> See https://bugs.mageia.org/show_bug.cgi?id=27319#c11 onwards.
> This includes several patches, so is worth trying even if it makes no
> difference. It should be out soon.
I read it, currently I am using x11-server-xorg-1.20.9-1.mga7
Comment 44 Aurelien Oudelet 2020-10-16 17:34:37 CEST
An update to x11-server is on its way.
Feel free to add remark here.

Assigning this to Kernel and Drivers Maintainers.
I think we have done job to try to circumscribe the issue.

Trying to sum up:
This system with M5 was OK.
M6 was skipped.
M7 has issues:
- Plasma + nvidia nonfree => instabilities.
- Plasma + Nouveau driver => no freeze but sometimes, GUI is lost.
- XFCE + Nouveau => idem

Graphic card is a NVIDIA G80GL [Quadro FX 5600].

We know that Nvidia cards have several issues with Linux. No good documentation, closed source drivers.
That I don't understand is that if M5 runs OK, what was the driver used?
If it was Nvidia nonfree, what was his version?

Keywords: NEEDINFO => (none)
Assignee: bugsquad => kernel
CC: kernel => (none)

Comment 45 kalagani kalagani 2020-10-16 19:00:39 CEST
Hello Aurelien,
I take the liberty of correcting your sum

> Trying to sum up:
> This system with M5 was OK -> with KDE+ nvidia nonfree, none pb
> M6 was skipped.
> M7 has issues:
> - Plasma + nvidia nonfree => instabilities. -> freezes also
> - Plasma + Nouveau driver => none test because impossible to use due too many artifacts, see Comment 12 for screenshots
> - XFCE + Nouveau => no freeze but sometimes, GUI is lost.
Comment 46 Lewis Smith 2020-10-18 21:21:25 CEST
(In reply to kalagani kalagani from comment #43)
> I have a top.log file (top > top.log) related
> to the same trace, but refused because it's over 1000k, I can truncate the
> not useful beginning to attach it.
Not only truncate it, but if you do attach it:
- *annotate it* with interleaved comments about what is happening. It is very difficult relating attached journals etc to the events that matter at the time.
- then use 'xz' to compress it:
 $ xz <filename> 
and upload the compressed file which ends in .xz
Comment 47 kalagani kalagani 2020-10-19 18:44:30 CEST
Hello,
> Not only truncate it, but if you do attach it:
> - *annotate it* with interleaved comments about what is happening. It is
> very difficult relating attached journals etc to the events that matter at
> the time.
Sorry but I stopped the top log too early (end at top - 22:56:25),!
I just realized it by comparing the end date of this one and the beginning date (POINT I JUDGE TO LOSS GUI oct. 12 23:53:06) of my annotations in the journalctl, see attachment 11933 [details] in Comment 37
So I think that this top log is useless.
Comment 48 kalagani kalagani 2020-10-27 16:12:57 CET
Hi,
I upgraded from x11-server-xorg-1.20.9-1 to x11-server-xorg-1.20.9-1.1
and in /etc/default/grub keeping none vga=791
return back to 
GRUB_GFXMODE=1024x768x32
GRUB_GFXPAYLOAD_LINUX=text
instead of
GRUB_GFXMODE=1900x1200
GRUB_GFXPAYLOAD_LINUX=keep

So new core notifier timeout
discovered (~19:13) when slowing down between windows without loss of GUI
Then launched a top > top.log
but first notifier timeout is at 17:52 (see also attached journalctl.log)
none between this and the discovery...
Comment 49 kalagani kalagani 2020-10-27 16:15:14 CET
Created attachment 11959 [details]
journalctl -b with new x11-server-xorg-1.20.9-1.1
Comment 50 kalagani kalagani 2020-10-27 16:19:50 CET
Created attachment 11960 [details]
top > top.log with new x11-server-xorg-1.20.9-1.1

top.log launched dicovering slowing down when changing windows, but first core notifier time out error is before...
Comment 51 kalagani kalagani 2021-04-30 20:05:43 CEST
Hi,
for informations, todays always the 
avril 30 19:27:27 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:30:20 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:37:44 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:38:14 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:38:47 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:38:58 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:39:02 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:39:29 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:39:31 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:39:33 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:39:37 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:40:04 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:40:09 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:40:20 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:40:24 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 30 19:40:28 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout

but also other times (since I wrote)
it seems it happens when I get out of the screen saver.
With or without virtualbox launched!

Current configuration:
uname -a
Linux HPxw9400 5.10.27-desktop-1.mga7 #1 SMP Wed Mar 31 00:16:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

rpm -qa |grep nouveau
x11-driver-video-nouveau-1.0.16-3.mga7
lib64drm_nouveau2-2.4.102-1.mga7

inxi -F
...
Graphics:  Device-1: NVIDIA G80GL [Quadro FX 5600] driver: nouveau v: kernel 
           Display: x11 server: Mageia X.org 1.20.11 driver: nouveau,v4l unloaded: fbdev,modesetting,vesa 
           resolution: 1920x1200~60Hz 
           OpenGL: renderer: NV50 v: 3.3 Mesa 20.2.3
...
kalagani kalagani 2021-04-30 20:06:48 CEST

CC: (none) => kalagani

Comment 52 Aurelien Oudelet 2021-07-06 13:17:04 CEST
Mageia 7 is EOL since July 1st 2021.
There will not have any further bugfix for this release.

You are encouraged to upgrade to Mageia 8 as soon as possible.

@reporter, if this bug still apply with Mageia 8, please let us know it.

@packager, if you work on the Mageia 7 version of your package, please check the Mageia 8 package if issue is also present. In this case, please fix the Mageia 8 version instead.

This bug report will be closed OLD if there is no further notice within 1st September 2021.
Comment 53 kalagani kalagani 2021-07-20 18:02:19 CEST
Hello Aurelien,
for information always on MAGEIA7, sometimes DRM core notifier errors from Nouveau drivers occurs.
It seems that it is an exit of the screen saver which is at the origin of the 1st error core notify, following this one others follow until freezing the screen.
This phenomenon seems to be amplified, but not systematically, when Virtual box is launched.
My last pb: 
journalctl -b -6 --no-hostname|grep -E "Reboot --|core notifier|Kernel command line|virtual"-- Reboot --
juil. 15 09:25:32 kernel: Booting paravirtualized kernel on bare hardware
juil. 15 09:25:32 kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.10.46-desktop-1.mga7 root=UUID=8ebeda33-9710-49a7-93a4-809d13e2809f ro splash quiet noiswmd resume=UUID=2fdb0460-b082-496e-b748-e904514ac886 audit=0
juil. 15 09:25:37 kernel: input: HP WMI hotkeys as /devices/virtual/input/input6
juil. 15 09:25:50 dkms-autorebuild.sh[806]: virtualbox (6.1.22-1.mga7): Already installed on this kernel.
juil. 15 11:33:11 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:16:53 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:17:11 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:17:59 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:18:25 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:18:31 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:19:57 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:20:09 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:22:25 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:22:29 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:24:25 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:24:34 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:24:38 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:32:41 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:32:48 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:32:56 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:33:07 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:33:12 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:33:15 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:33:59 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:34:09 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:34:14 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:34:21 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:34:24 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
juil. 15 12:34:28 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
Comment 54 Marja Van Waes 2021-09-07 14:11:21 CEST
Hi bug reporter and hi assignee and others involved,

Please reopen this bug report if it is still valid for Mageia 8 or 9(cauldron), and change "Version:" in the upper left of this report accordingly.

This report is being closed as OLD because it was filed against Mageia 7, for which  support ended on June 30th 2021.

Thanks,
Marja

Resolution: (none) => OLD
Status: NEW => RESOLVED

Comment 55 kalagani kalagani 2022-01-20 17:34:14 CET
Created attachment 13095 [details]
nouveau core notifier timeout with following trap
Comment 56 kalagani kalagani 2022-01-20 17:41:03 CET
Hello Marja,
under Mageia7/XFCE-Nouveau driver I continued with the slowdowns but the freezes having resumed,
I switched to Mageia8 XFCE/Cinnamon-Nouveau driver
and the freezes arrive too.

Neither virtualbox nor any personal script is launched.
The novelty is the multiple TRAP [ cut here] traces following the core notifier timeout.
CTRL+Alt+F2 is inoperative, only a remote ssh allows the shutdown

I put log tracing some core notifier with TRAP from cmd:
journalctl --since 13:46 --until "15:47:31" > 2022-01-20_journalctlTrapsNouveauSince13-46toSSHshutdown.log

Good year despite everything...

Status comment: x11-driver-video-nouveau-1.0.16-3.mga7 => x11-driver-video-nouveau-1.0.16-3.mga7-> x11-driver-video-nouveau-1.0.17-1.mga8
Version: 7 => 8

Comment 57 Morgan Leijström 2022-01-20 20:14:57 CET
Reopening for now.

Driver maintainer to tell if it is better to start a new bug instead.

CC: (none) => fri
Resolution: OLD => (none)
Status: RESOLVED => REOPENED

Comment 58 kalagani kalagani 2022-04-07 17:47:24 CEST
Hello,
post here to say that the problem continues,
e.g. for the month of April
out of 11 starts, 5 end up with freezes
-- Reboot --
avril 01 18:25:09 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 01 18:25:13 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
-- Reboot --
avril 02 23:02:08 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 02 23:02:12 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
-- Reboot --
avril 03 23:09:29 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 03 23:09:33 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 03 23:09:36 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
-- Reboot --
-- Reboot --
-- Reboot --
-- Reboot --
-- Reboot --
-- Reboot --
avril 06 17:19:39 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 18:28:16 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 18:28:38 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 18:34:46 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 18:36:23 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
-- Reboot --
avril 06 20:36:28 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 23:02:54 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
avril 06 23:03:10 HPxw9400 kernel: nouveau 0000:18:00.0: DRM: core notifier timeout
-- Reboot --
Comment 59 kalagani kalagani 2022-04-30 18:29:04 CEST
Created attachment 13231 [details]
journalctl --since "2022-04-01" --until "2022-04-30" --no-hostname |grep -E "drivers/gpu/drm/nouveau|- Reboot|DRM: base|DRM: core"
Comment 60 kalagani kalagani 2022-04-30 18:31:11 CEST
Hello,
log in attachment for the only April month from command
journalctl --since "2022-04-01" --until "2022-04-30" --no-hostname |grep -E "drivers/gpu/drm/nouveau|- Reboot|DRM: base|DRM: core" > journalctl2022-04grepRebootnouveauDRM.log

In this log, sometimes but not always when DRM fault WARNING in files
drivers/gpu/drm/nouveau/dispnv50/disp.c:213 nv50_dmac_wait+0x1e1/0x230 [nouveau]
or
drivers/gpu/drm/nouveau/nvkm/engine/fifo/channv50.c:85 nv50_fifo_chan_engine_fini+0x224/0x270 [nouveau]

Configuration:
rpm -qa |grep nouveau
x11-driver-video-nouveau-1.0.17-1.mga8
lib64drm_nouveau2-2.4.110-1.mga8
Comment 61 kalagani kalagani 2022-06-19 19:40:43 CEST
Hello,
just so you know, since I switched to iceWM instead of xfce, no more freezes.

The difference between these 2 desktops is that there is no screensaver-locker launched with iceWM.
I already suspected this feature since freezes were often seen after a period of inactivity.
Surprise, it is not an xfce screensaver that is launched during Xfce sessions but the Cinnamon one.
Indeed I also installed this last in parallel of XFCE.

Under XFCE journalctl often shows
cinnamon-screensaver: Fatal IO error 11
alone or before or after 
new 0000:18:00.0: DRM: core notifier timeout
Under iceWM no trace of cinnamon-screensaver since no screensaver is run
and none core notifier timeout error
Comment 62 ward rose 2022-09-09 11:42:35 CEST Comment hidden (spam)

CC: (none) => wardrose4472902


Note You need to log in before you can comment on or make changes to this bug.