Bug 28026

Summary: xfce4 session occasionally freezes
Product: Mageia Reporter: Juergen Harms <juergen.harms>
Component: RPM PackagesAssignee: Jani Välimaa <jani.valimaa>
Status: RESOLVED INVALID QA Contact:
Severity: normal    
Priority: Normal CC: davidwhodgins, fri, ouaurelien, shybluenight
Version: Cauldron   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: xfce4-session-4.16.0-2.mga8.src.rpm CVE:
Status comment:
Attachments: Contents of .xsession errors (obtained via ssh from another system)
.xsession_errors after another freeze
inxi -F as requested in comment 4
/var/log/Xorg.0.log immediately after freeze, before reboot.
.xsession_errors immediately after freeze, before reboot

Description Juergen Harms 2021-01-06 12:51:26 CET
Description of problem:

On my (fully updated) cauldron system, xfce sessions occasionally freeze (no response to keyboard or mouse-button actions - everything else apparently working). The only way back to normal is to reboot (power-cycle). This happen sporadically - apparently at increasing intervals - originally I sometimes had several incidents per day, today's incident happened after > 1 week without problem.

While the session is frozen, the system can be accessed via ssh from another system ot loook at logs: nothing I could understand as an explanation from dmesg or /var/log/messages, but .xsession errors each time illustrates that there have been problems - apparently not from the xfce session itself, but at some lower level. A possible explanation could be that the xfce session freezes in consequence when it should launch a corresponding dialogue with the user.

Version-Release number of selected component (if applicable): xfce4-session-4.16


How reproducible: happens at random


Steps to Reproduce:
1.
2.
3.

There exists a report in the new Xfce bug reporting system at gitlab that might have a similar orgigin:

https://gitlab.xfce.org/xfce/xfce4-panel/-/issues/374

When I added my comment to that incident, the cause evidently was some problem of the power manager. Another freeze happend while I installed some update packaes and the mirror had some failure. With today's incident, .xsession errors contains several error messages, but no clear indication (see attachment).

I realise that this report does not provide much evidence that could help to clarify and correct this problem.
But I think it is important to - at least - document this issue.
Comment 1 Juergen Harms 2021-01-06 12:54:54 CET
Created attachment 12186 [details]
Contents of .xsession errors (obtained via ssh from another system)
Morgan Leijström 2021-01-06 13:12:22 CET

CC: (none) => fri

Comment 2 Juergen Harms 2021-01-06 19:26:17 CET
Created attachment 12187 [details]
.xsession_errors after another freeze
Comment 3 Juergen Harms 2021-01-06 19:28:43 CET
I just had my 2nd freeze of today, adding a dump of .xsession-errors. Both freezes of today happened while I read was reading the online web page of my journal (NZZ)
Comment 4 Aurelien Oudelet 2021-01-07 23:13:33 CET
Hi, thanks reporting this.

I don't know how you installed Cauldron. (From Beta2 classic install ISO ? from updating M7? from the xfce live iso ?)
Also, .xsession_errors file does not matter when there is a total system freeze.

Note that XFCE has been updated to his latest upstream version since M8 beta 2 iso were released, but packages were pushed as long as they are built. It is possible that one of necessary one is missing on your system and was uninstalled while upgrading one other.


I suggest you:

1) inxi -F
to see your hardware specifications, adding here as attachment.

2) Even wait for real Mageia 8 RC1 iso to wipe your / system partition and reinstall a good set of packages.

CC: (none) => ouaurelien

Comment 5 Juergen Harms 2021-01-08 11:11:07 CET
Created attachment 12196 [details]
inxi -F  as requested in comment 4
Comment 6 Juergen Harms 2021-01-08 11:35:20 CET
 > I don't know how you installed Cauldron. (From Beta2 classic install ISO ? from updating M7? from the xfce live iso ?)

Beta 1 classic install ISO (cdrom, + update packages installed as they arrive - I verified that the list of my xfce the packages corresponds to that enumerated in the list that figured on the mirror when I submmitted the bug report)

>Also, .xsession_errors file does not matter when there is a total system
    freeze.
It is not a total system freeze - but a freeze of all session I/O (keyboard, and mouse I/O dead) (if the freeze were total, access to the log files via ssh would not be possible)

> ... but packages were pushed as long as they are built. ...
    Using Cauldron only makes sense if new update packages are installed as soon they become available - which I do

I suggest you:

1) inxi -F
    done.
    But I have these freezes both on my PC and a laptop, both with totally different hardware specifications; to avoid confusion, I limited the data included into this bugzilla report to data from the PC - retrieved via ssh from the laptop).

2) Even wait for real Mageia 8 RC1 iso to wipe your / system partition and
    reinstall a good set of packages.
will do, waiting for RC1.
With clean installs, my system partition is always totally wiped out.
Comment 7 Aurelien Oudelet 2021-01-08 11:44:17 CET
OK, so the freeze is rather in X session than a total system one. Good.

So, when such freeze appears, can you give us the system log from:
# journalctl -f

if you can catch it over a ssh connection.
If there is a video driver issue, we can even see it.

.xsession_errors file is for graphical application.

X doesn't log here.

Also, please attach the /var/log/X.0.org.log file here of the freeze X session.
If rebooted, it will be renamed with a 1 instead of 0 in file name.
Comment 8 Juergen Harms 2021-01-11 09:07:44 CET
I decided to stop maintaining a Mageia OS partition, there is not much sense in keeping this bug open
Comment 9 Lewis Smith 2021-01-11 20:35:09 CET
This is sad. It would be nicer to pin it down. There are plenty of people who use Xfce routinely, so the problem is not general.
And you could try LXDE, which is very similar in appearance & use to Xfce (I use both, and often wonder which one I am actually using).

Thank you for your own investigations; and the system info.

> While the session is frozen, the system can be accessed via ssh from
> another system
Did you ever try accessing a virtual console, Ctl/Alt/F2-6 ?
And if that worked, whether going back to the GUI Ctl/Alt/F1 had any effect?

> The only way back to normal is to reboot (power-cycle).
Re-starting X usually works in such cases, & is easier & faster:
 Ctrl/Alt/Bksp/Bksp

CC: (none) => lewyssmith

Comment 10 Aurelien Oudelet 2021-01-17 17:16:06 CET
Reporter, could you please reply to the previous question? If you don't reply within two weeks from now, I will have to close this bug as OLD. Thank you.

Keywords: (none) => NEEDINFO

Comment 11 Juergen Harms 2021-01-25 16:25:50 CET
> Reporter, could you please reply
Sorry, working with Cauldron has become quite painful - I am not much on Mageia any more.

Previous question:
> Did you ever try accessing a virtual console, Ctl/Alt/F2-6 ?
> And if that worked, whether going back to the GUI Ctl/Alt/F1 had any effect?

Would be great - but: how do I type Ctrl/Alt/something on a system with a frozen keyboard (and mouse)?

I realize that I posted this as a bug on XFCE. I now have doubts whether the bug is primarily an XFCE issue - or whether a lower level problem just makes itself seen via XFCE. Reason for this doubt:

1. Significant log entries only appear in .xsession_errors and are generated by  applications, not directly by XFCE.

2. Doing production usage on fully customized XFCE environments on Fedora and Debian never produces a freeze. Why drop XFCE under these conditions? its the the smoothness of the synapses in my fingers that makes me reluctant.

Re /var/log/X.0.org.log:
I am confused: do you mean /var/log/Xorg.0.log (no /var/log/X.0.org.log on my Mageia filesystems) - and, old log files get an "old" suffix, 0, 1 etc specifying a specific display; is it really Xorg.0.log that you want?

I will switch back to my Mageia partition and create a copy of /var/log/X.0.org.log as soon as the next freeze happens (and as the related damage is not so serious that immediate repair is more urgent then dumping Xorg logs) - waiting for a suitable occasion will take some time.

I dont mind if you simply close the bug. But, as said in comment 9, there exists objective interest to clarify the toppic - that also is the reason why I posted this bug in absence of confirmation from other users.
Comment 12 Lewis Smith 2021-01-25 20:43:02 CET
Thank you for returning to this.
> how do I type Ctrl/Alt/something on a system with a frozen keyboard
I did say 'try'. If it does nothing, it is easy to say so.
Similarly with Ctrl/Alt/Bksp/Bksp .

Re the Xorg log file, doubtless /var/log/Xorg.0.log was the one.
If it is very big, better to make a copy and compress *that* with (say) xz.
Comment 13 Juergen Harms 2021-01-25 23:11:17 CET
Created attachment 12263 [details]
/var/log/Xorg.0.log immediately after freeze, before reboot.
Comment 14 Juergen Harms 2021-01-25 23:12:24 CET
Created attachment 12264 [details]
.xsession_errors immediately after freeze, before reboot
Comment 15 Juergen Harms 2021-01-25 23:34:03 CET
Waiting for a new freeze was much shorter than I had anticipated. I added attachements with xz-compressed copies of /var/log/Xorg.0.log, and also of .xsession_errors to provide a consistent view. The freeze happened on the machine with the properties documented in the 3rd attachement (inxi, 2021-01-08).

The copies were drawn via ssh from another machine while I/O of the target machine was still in its frozen state (hence Xorg.0.log and not ... .old). The freeze happened while I was using emacs for editing a perl file.
Comment 16 Juergen Harms 2021-01-25 23:40:44 CET
Sorry, I missed copying the output of journalctl - will do this at the next freeze.
Comment 17 Lewis Smith 2021-01-26 21:44:53 CET
Thank you for the log information. That should do.

"(EE) event21 - Logitech USB Keyboard: client bug: event processing lagging behind by 12ms, your system is too slow"
x n
"(EE) event21 - Logitech USB Keyboard: WARNING: log rate limit exceeded (5 msgs per 60min). Discarding future messages."
Looks relevant.

Assigning to wally for Xfce4; but it may be more general - Xorg; pass it on where you see fit.

Keywords: NEEDINFO => (none)
Assignee: bugsquad => jani.valimaa
CC: lewyssmith => (none)

Comment 18 Juergen Harms 2021-01-29 09:55:15 CET
Thank you, nice progress.

Just now I had my next freeze - this time after a mouse event (hitting a GtkButton) (Logitech M325 Mouse). I pulled corresponding dumps and will keep them available - up to the bug team to say if you want me to provide them as attachments
Comment 19 Chris B 2021-01-29 10:53:37 CET
Jürgen, for me this looks like your mouse and/or keyboard are loosing power.
Autosuspend mode for the usb ports? Powermanager settings? Tlp?
Batteries low of the mouse?
An usb hub maybe? Or a usb-c to usb adapter?

Not sure if this is an xfce bug. But you can try to disable in the xfce settings.
Settings - Settings and Startup - Power Manager.

Or a setting in the bios/efi firmware, relating to the power of the usb ports?

CC: (none) => shybluenight

Comment 20 Juergen Harms 2021-01-29 11:29:49 CET
Yes, that is a possible explanation. I have already planned to explore this (e.g. connect the keyboard directly to the machine, rather than via hub, use an alternate keyboard ...). However, periphery as the cause is unlikely, because
- freezes occur at randon both on my PC and my laptop
- freezes dont occur when I use an OS partition with (identically - script generated - XFCE customization) on Debian or Fedora (but dont take "dont" too literally where incidents arrive only every couple of days).

I will post if I find something significant. Thanks for your help.
Comment 21 Juergen Harms 2021-02-03 08:56:15 CET
I have finally accomplished all tests for/against local problems I could imagine (one change by one, I did not try permutations on combinations) 
- changed the battery of my mouse
- connecting the keyboard and the mouse dongle directly to USB plugs on the computer rather than via a hub
- using another USB keyboard

All tests negative: the freezes still happen, at random, at irregular but longish intervals (several days in the average); writing this, yet another alternative comes to mind: use another mouse, not the M325 wireless one.

Being somewhat repetitive: the following arguments contribute to / against the likelyhood that this problem is / is not local:

- I have not observed a single freeze when I was running these machines under Debian and Fedora

- no other Cauldron user has reported hitting this issue, and I am not the only 
one using Cauldron in production (do others have the same kind of non-ending sessions?)

- freezes happen on machines with radically different architecture (a powerfull PC and a Laptop)

- but these 2 machines have one thing in common: the way XFCE is set up, using a script

- but precisely the same script also sets up XFCE on my Fedora and Debian OS partitions

Probably the best approach is now to wait until Mageia-8 has been released and usage goes to the common public.
Comment 22 Chris B 2021-02-03 10:08:40 CET
Seems that it is mageia specific.

Is it a Logitech unifying receiver? Is solaar installed, special for these kind of receivers?

It could be an issue with the way mageia is detecting and configuring hardware
(drakx, udev ...). That would be the main difference to Debian and Fedora.

(I had a minor hardware problem with Mageia 7 (and only with Mageia) on a laptop and the initializing of its touchpad, randomly it wouldn't detect the touchpad at boot time, only after suspend and resume it would detect it. Magically that is solved with Mageia 8 ;-)
Comment 23 Juergen Harms 2021-02-03 18:03:04 CET
>Is it a Logitech unifying receiver? Is solaar installed, special for these kind >of receivers?

It is a Logitech M325 with a unifying receiver (and yes, solaar is installed) - your comment is motivation to make yet another test with a wired mouse ... ughhh
Comment 24 Juergen Harms 2021-02-14 21:32:29 CET
Now I have also made the test to replace the wireless mouse by a wired mouse. Worked so long, that I already thought that it might really be the mouse.

But I just had a typical freeze - so it is not the mouse.
Comment 25 Dave Hodgins 2021-02-14 23:30:22 CET
Install htop, open a terminal, use "su -" to become root, run htop, press f6
(sort by), select STATE, and leave the terminal such that enough of it is
visible to see the columns S (state) and the command.

I expect that when the mouse freezes, there will be one or more commands shown
in the D (device wait) state. We need to id which commands are causing that,
and then see what can be done to minimize the impact. Hopefully it will be
commands that can be disabled.

CC: (none) => davidwhodgins

Comment 26 Juergen Harms 2021-02-15 09:45:16 CET
Thanks, that makes good sense. And since top and htop can be run via ssh, it is practicable even with a frozen mouse - will do, but will take time waiting for events.
Comment 27 Juergen Harms 2021-02-15 22:54:59 CET
The freezes happen both on my laptop and the desktop - to keep things simple, so far I did not report on laptop freezes. But, just now a freeze occured on my laptop, an occasion to rapidly apply the suggestion made by Dave Hodgins. The frozen process (state D) had PID 1715 (PPID is 1701 = /usr/sbin/lightdm. I used ps -l for the benefit of being able to copy/paste the command from my console window:

/usr/libexec/Xorg :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt1 -novtswitch

The other remarkable fact is that the process appears to loop: cpu percentage varies around 95%
Comment 28 Morgan Leijström 2021-02-16 09:43:26 CET
Not being an expert on this, but it looks like it would be interesting to try another login manager.
Comment 29 Juergen Harms 2021-02-16 13:22:40 CET
Yes, that started as a dull exercise to distinguish between noise and not noise - it starts to become interesting.

Different display manager: there I need expert advice. lxdm would appear a simple alternative, but I would like to avoid losing time finding out how make lxdm used (but I found several google hits, enough to make an initial try) 

For the moment, I wait for the next incident on the desktop machine to verify whether the scenario is similar to that on the laptop.
Comment 30 Morgan Leijström 2021-02-16 13:58:09 CET
In MCC tab "Start" there is a icon for Display Manager, where you can select amongst installed ones.

So first install one you want, then select it there, reboot :)
Comment 31 Juergen Harms 2021-02-17 11:13:31 CET
Just now my desktop got stuck - precisely the same situation as on the laptop: a process with status D, identical contents of the triggering command line.

Having a closer look at ps -lA, I also found a kworker process (not shown by htop)  with the D status, its command line is 
[kworker/7:1H+events_highpri]

I am perfectly willing now to try using lxdm - but shouldnt priority go into exploring why lightdm is upset?
Comment 32 Chris B 2021-02-17 11:45:02 CET
No need to install a different display manager.
You can start xfce4 without any DM.

At the grub kernel command line (type 'e' to get it when starting your machine),
add '3' (without the quote), ctrl x to start, login  as user, then type:
startxfce4

Make sure, lightdm is not running, not started. Htop or top.

See what happens.

Several QA Mageia users use xfce and lightdm, nobody so far reported this problem.

Your 'special' script you are running to configure xfce might be intersting as well.

Also, a fresh installation, without any personal scripts for xfce, could be an option. The xfce version in M8 btw is xfce 4.16.x, fully gtk3, no more gtk2.
Comment 33 Dave Hodgins 2021-02-17 16:00:46 CET
The kworker thread, visible in htop if you press the "K" (note: uppercase)
to toggle showing threads on/off is a kernel thread that handles i/o operations
for a device.

It's stuck in the device wait state as the kernel is waiting for some i/o
operation to complete. When the kernel is stuck, it can not respond to
user space applications.

It may be due to swapiness. See https://rudd-o.com/linux-and-free-software/tales-from-responsivenessland-why-linux-feels-slow-and-how-to-fix-that

It may be due to partition alignment.  The kernel reads/write 4KB per logical
i/o. Older disk drives used 512 byte logical and physical sectors, so 8 sectors
per 4KB block.

Newer hard drives either use 512 byte logical sectors with 4KB physical sectors
or 4KB logical sectors with 4KB physical sectors. There were some really bad
drives at the beginning of the new drives that lie to the kernel and claim to
be using 512 byte physical sectors even though they really use 4KB physical
sectors. This was done for compatibility with windows software that wasn't
ready to handle the 4KB sector sizes. The problem with those drives is that
when the kernel writes a 4KB block in an i/o request, if those 8 512 byte
sectors overlap two 4KB blocks in the hard drive, the firmware in the hard
drive (much slower than a cpu) has to translate the one write request into
two reads, a merge of the updated 4KB write with the two 4KB sectors, and two
writes. This drastically slows down writes that are not aligned on 4KB boundaries.

Diskdrake uses 1MB boundaries for partitions (a multiple of 4KB), but not all
partitioning software does that. Depending on what software was used, the
partitions may not be aligned on 4KB boundaries. The command
"sfdisk -luS /dev/sda" will show the start sector of each partition for that
drive. If any of those sectors start at a number that is not evenly divisible
by 8, that's a problem (with the exception of the Extended partition which it's
technically ok not to have aligned, since it's just a sector with a partition
table).

Another cause of device waits is bad hardware, such as a sata cable that has
a dirty connector forcing the kernel to retry operations several times to
get a successful read/write. Though it does work, it's slow. That should be
visible in dmesg output. There are also some sata controllers that are known
to be very poor, though I'm not clear why.

Whether it's due to swappiness, partition alignment, or bad hardware, this is
not a software problem. It's system tuning and/or hardware fixing/replacing.
Comment 34 Morgan Leijström 2021-02-17 16:20:29 CET
The fact Juergen see this on two systems speak against hardware fault
(or is a low probability coincidence)
Comment 35 Morgan Leijström 2021-02-17 16:20:53 CET
The fact Juergen see this on two systems speak against hardware fault
(or is a low probability coincidence)
Comment 36 Juergen Harms 2021-02-17 20:12:47 CET
I think that a wise decision would be to close this bug now - to be reopened, or re-filed in case other users hit this kind of problem, and preferably if means are found  to reproduce the events that trigger the freeze by explicit action. I had originally opened the bug because such unexplained problems should not be left simply hanging around. There is now a clear idea on the mechanism.
Comment 37 Dave Hodgins 2021-02-17 23:03:04 CET
If both systems have similar spinning rust drives, ram/swap usage, and
application usage, I'm not surprised by both systems showing the same problem.

The swappiness default settings could be changed. They are a tradeoff between
the best settings for servers, and the best settings for desktop systems.
Currently it's set roughly in the middle, slightly in favour of desktops.
Mageia is intended to be useful for both servers and desktops, so I'm not in
favour of changing the defaults, but that would be up to our kernel admins.
That may be a factor that exacerbates the problem, but is not the cause.

The cause of device waits is hardware, and hardware only. There is nothing
that can be done in software to solve the waits. They can be reduced by
system tuning (swapiness settings, removing any services that are not deemed
essential, etc.), but that is up to the system admin as what's suitable for
one user is not going to be the same as what's suitable for another.

My recommendation is to get an ssd drive, and only use the spinning rust drives
for bulk storage.  The difference in speed with an ssd drive, is impressive.

I have the same problem with the device waits with an AMD/ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode], and a WDC WD10EZEX-00RKKA0 hard drive. I now use
it for bulk storage (iso images, video files, etc.). Everything else I keep on
ssd drives which do not have the problem to anywhere near the same extent.
It still happens, but much less often and for much shorter periods of time,
and is only noticeable if I have htop running.

As per comment 36, closing as invalid, however if future reports do come in,
they should be closed as a duplicate of this bug. This bug should not be
reopened.

Status: NEW => RESOLVED
Resolution: (none) => INVALID

Comment 38 Juergen Harms 2021-02-18 19:29:25 CET
Both laptop and desktop have their root partition on SSD devices, but some shared data resides on a hard drive and might have been accessed - difficult to verify post festum.