17574 – Reboot from login or desktop takes a long time ("watchdog failed to stop")

Bug 17574 - Reboot from login or desktop takes a long time ("watchdog failed to stop")

Summary: Reboot from login or desktop takes a long time ("watchdog failed to stop")

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	RPM Packages (show other bugs)
Version:	Cauldron
Hardware:	x86_64 Linux

Priority:	Normal Severity: normal
Target Milestone:	---
Assignee:	Colin Guthrie
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-01-22 00:00 CET by Len Lawrence
Modified:	2016-05-11 22:22 CEST (History)
CC List:	8 users (show)

See Also:
Source RPM:	systemd
CVE:
Status comment:

Attachments
journalctl of long shutdown (55.06 KB, text/plain) 2016-02-14 12:24 CET, Marja Van Waes	Details
View All Add an attachment (proposed patch, testcase, etc.)

Description Len Lawrence 2016-01-22 00:00:18 CET

Description of problem:
Using the classic DVD iso the system takes several minutes to closedown and restart.  The Plymouth splashscreen is displayed with five bubbles then four.  Ctrl-Alt-F1 shows the message "watchdog failed to stop" but in the end the system does reboot; presumably whatever was stalling it timed out.

Version-Release number of selected component (if applicable):
mga6-dev1

How reproducible:
Same behaviour for three DEs and the sddm login screen.

Steps to Reproduce:
1.
2.
3.


Reproducible: 

Steps to Reproduce:

Comment 1 Frank Griffin 2016-01-22 00:45:50 CET

Yes, this has been happening in cauldron for a while now.  I can't remember the last time a reboot didn't involve the SysRq keys - E causes a flurry of activity, but I is what puts it to bed.

Colin, could we have a quick tutorial on how to collect diagnostics for these types of hang ?

Comment 2 Bit Twister 2016-01-22 00:45:50 CET

There are at least 3 players in that problem, users, several services, and finally sddm. I opened bug 17530 - mga6: plymouth-reboot is delaying reboots for that one.

For the user, disabling kwallet and setting KillUserProcesses=yes in /etc/systemd/logind.conf seems to help.

For the services, I found shutting the main ones down before issuing reboot/shutdown/halt.

Here is the current list for my install in my shutdown_servers script.
 _servers="mythbackend named postfix httpd dovecot stunnel mysqld chronyd dhcpd"

CC: (none) => bittwister2

Comment 3 Colin Guthrie 2016-01-22 10:18:28 CET

Generally speaking you can ensure the debug shell is active, (boot with "systemd.debug-shell" on the kernel command line), this allows you to switch to tty9.

When things are hanging, try switching there and seeing if there are any specific jobs listed (systemctl list-jobs).

I suspect it could be the bug I'm struggling to find time to get to the bottom of that relates to .socket unit jobs clogging up the system (or at least related to it).

Comment 4 Frank Griffin 2016-01-22 13:55:16 CET

At least in my experience, these delays happen during shutdown, not startup.  Once you do Alt-SysRq-E, it looks like journalctl was the blocker, and the ones that -I takes down appear to be unmounts of NFS drives, maybe because the network is already down ?

Is any of the stuff that shows up on tty when you shutdown with ESC logged anywhere ?

Comment 5 Frank Griffin 2016-01-22 14:28:36 CET

Some more shutdown info...

As shutdown starts, CUPS seems to take quite a while to shut down (but eventually does).  The next thing that blocks is "Session c2 for ftg" (ftg is the user who initiated the reboot from plasma).  If I'm reading this right, this will only time out in 1.5 minutes (1.5 minutes ?!), and it looks like that's what happens.

Finally, there's the unmounting of the NFS partitions which appear to have timeouts of 3 minutes (3 minutes ?! Really ?).

After -I, there are some messages about systemd-journald not having a restart timeout so it is being restarted.

Jüri Ivask 2016-01-22 16:05:09 CET

CC: (none) => jyri2000

Comment 6 Jüri Ivask 2016-01-26 08:29:20 CET

Made a log af last boot. Here the delay in poweroff process is 1 min 20 sec.
According place in the log:

jaan  25 17:32:46 localhost systemd[1]: Received SIGRTMIN+20 from PID 11008 (plymouthd).
jaan  25 17:34:08 localhost systemd[1]: session-c5.scope: Stopping timed out. Killing.
jaan  25 17:34:08 localhost systemd[1]: Stopped Session c5 of user acrux.
jaan  25 17:34:08 localhost systemd[1]: session-c5.scope: Unit entered failed state.
jaan  25 17:34:08 localhost systemd[1]: session-c2.scope: Stopping timed out. Killing.
jaan  25 17:34:08 localhost systemd[1]: Stopped Session c2 of user acrux.
jaan  25 17:34:08 localhost systemd[1]: session-c2.scope: Unit entered failed state.
jaan  25 17:34:08 localhost systemd[1]: Removed slice User Slice of acrux.
jaan  25 17:34:08 localhost systemd[1]: Stopping Login Service...

So after successfull shutting down plymouthd the delay occures with stopping these sessions c5 and c2 of user acrux...

Comment 7 Marja Van Waes 2016-02-14 12:24:46 CET

Created attachment 7455 [details]
journalctl of long shutdown

I've seen user sessions taking long to stop while shutting down, too, including last time I shut down:

feb 14 00:13:20 cldrn_64 systemd[1]: session-c2.scope: Stopping timed out. Killing.

last night, the two lines with the longest time between them in journalctl output were:

feb 14 00:12:09 cldrn_64 systemd[1]: Received SIGRTMIN+20 from PID 30202 (plymouthd).
feb 14 00:13:20 cldrn_64 systemd[1]: mandi.service: State 'stop-sigterm' timed out. Killing.


all red lines in journalctl output were:

feb 14 00:11:52 cldrn_64 login[29550]: pam_systemd(login:session): Failed to release session: Connection reset by peer

feb 14 00:11:53 cldrn_64 nmbd[5728]: [2016/02/14 00:11:53.038574,  0] ../source3/nmbd/nmbd.c:58(terminate)

feb 14 00:11:53 cldrn_64 nmbd[5728]:   Got SIGTERM: going down...

feb 14 00:11:51 cldrn_64 systemd-logind[790]: Failed to abandon session scope: Transport endpoint is not connected

feb 14 00:11:52 cldrn_64 systemd-logind[790]: Failed to abandon session scope: Transport endpoint is not connected

feb 14 00:11:54 cldrn_64 systemd[1]: Failed unmounting /RedHD2.

feb 14 00:11:55 cldrn_64 cupsd[1327]: Unable to communicate with avahi-daemon: An unexpected D-Bus error occured


Attaching full shutdown log

CC: (none) => marja11

Comment 8 Marja Van Waes 2016-02-14 12:26:52 CET

@ Colin

Assigning to you, since you were already looking into it.

Feel free to reassign when needed :-)

Assignee: bugsquad => mageia

Comment 9 Marja Van Waes 2016-02-14 12:28:12 CET

and assigning to systemd :-D

Component: Release (media or process) => RPM Packages
Source RPM: (none) => systemd

Comment 10 Marja Van Waes 2016-02-14 12:36:23 CET

after commenting in this report, i shutdown my cauldron and it was amazingly fast!

i didn't really look which updstes i had, only remomber a kernel

oh, and i may have rebooted earlier today, after some updates


so.... did this get fixed for all and for every reboot or shutdown?

Comment 11 Marja Van Waes 2016-03-07 11:01:26 CET

(In reply to Marja van Waes from comment #10)
> after commenting in this report, i shutdown my cauldron and it was amazingly
> fast!
> 

> 
> so.... did this get fixed for all and for every reboot or shutdown?

The issue returned, probably within a week after I wrote this. For me, it is always that every Plasma5 session that was started from sddm has to reach a time out before being killed (on a side note: logging out and back into Plasma5, the previous session keeps "State=closing" for ever, which is the same state as for sddm). 

Anyway: shutting down:
mrt 07 09:06:55 cldrn_64 systemd[1438]: Reached target Shutdown.
mrt 07 09:06:55 cldrn_64 systemd[1438]: Starting Exit the Session...
mrt 07 09:06:55 cldrn_64 sddm[972]: Authentication error: "Process crashed"
mrt 07 09:06:55 cldrn_64 sddm[972]: Auth: sddm-helper crashed (exit code 15)
mrt 07 09:06:55 cldrn_64 sddm[972]: Authentication error: "Process crashed"
mrt 07 09:06:55 cldrn_64 sddm[972]: Auth: sddm-helper exited with 15
mrt 07 09:06:55 cldrn_64 sddm[972]: Socket server stopping...
mrt 07 09:06:55 cldrn_64 sddm[972]: Socket server stopped.
mrt 07 09:06:55 cldrn_64 sddm[972]: Display server stopping...
mrt 07 09:06:55 cldrn_64 systemd-logind[886]: Failed to abandon session scope: Connection reset by peer
mrt 07 09:06:55 cldrn_64 systemd-logind[886]: Failed to abandon session scope: Transport endpoint is not connected

..........

mrt 07 09:08:25 cldrn_64 systemd[1]: session-c2.scope: Stopping timed out. Killing.
mrt 07 09:08:25 cldrn_64 systemd[1]: Stopped Session c2 of user marja.

Of course, the delay is n times 90 seconds if n times a Plasma5 session was started

If I can still not reproduce this using XFCE and if this persists after the updates neoclust it pushing, then I'll file a separate bug report for this issue and let it block this one.

However: should it then be assigned to neoclust or to coling?

CC: (none) => neoclust

Comment 12 Jüri Ivask 2016-03-17 15:08:58 CET

Seems to be fixed?
Shutdown or reboot from Plasma 5 desktop is very fast again...

Comment 13 Bit Twister 2016-03-17 16:25:55 CET

Must be nice. I will agree time to login prompt has improved. Guessing because of a reduction in some timeout value.

Plymouth start manager is still crashing on first attempt. Once that times out and gets restarted the actual login is still slow. vb=vritualbox.
        boot  login
mga5vb  0:11  0:21
mga6vb  0:22  0:34
mga6    0:47  1:51   real hardware

Comment 14 Len Lawrence 2016-05-09 16:58:04 CEST

Should this bug be closed?  Does anybody still see this.  Not evident here any longer.

Comment 15 Bit Twister 2016-05-09 17:12:29 CEST

(In reply to Len Lawrence from comment #14)
> Should this bug be closed?  Does anybody still see this.  Not evident here
> any longer.

Boot is still failing.
 journalctl | grep plymouth
May 09 09:23:29  systemd[1]: Starting Show Plymouth Reboot Screen...
May 09 09:23:30  systemd[1]: Started Show Plymouth Reboot Screen.
May 09 09:24:03  dracut: Starting plymouth daemon
May 09 09:24:03  systemd[1]: Starting Show Plymouth Boot Screen...
May 09 09:24:04  plymouth[540]: error: unexpectedly disconnected from boot status daemon
May 09 09:24:04  systemd[1]: plymouth-start.service: Main process exited, code=killed, status=11/SEGV
May 09 09:24:04  systemd[1]: Failed to start Show Plymouth Boot Screen.
May 09 09:24:04  systemd[1]: plymouth-start.service: Unit entered failed state.
May 09 09:24:04  systemd[1]: plymouth-start.service: Failed with result 'signal'.
May 09 09:24:06  systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
May 09 09:24:06  systemd[1]: Started Tell Plymouth To Write Out Runtime Data.

Comment 16 Rémi Verschelde 2016-05-10 14:18:02 CEST

(In reply to Bit Twister from comment #15)
> (In reply to Len Lawrence from comment #14)
> > Should this bug be closed?  Does anybody still see this.  Not evident here
> > any longer.
> 
> Boot is still failing.
>  journalctl | grep plymouth

I'd suggest to open a new bug report for this. Personally I can't reproduce the issue in comment 0 anymore, even though I did have it back in January, so as far as I'm concerned this bug is fixed.

Status: NEW => RESOLVED
Resolution: (none) => FIXED
Summary: Reboot from login or desktop takes a long time => Reboot from login or desktop takes a long time ("watchdog failed to stop")

Comment 17 Barry Jackson 2016-05-11 22:22:12 CEST

(In reply to Rémi Verschelde from comment #16)
Personally I can't reproduce
> the issue in comment 0 anymore, even though I did have it back in January,
> so as far as I'm concerned this bug is fixed.

Likewise.

CC: (none) => zen25000

Note You need to log in before you can comment on or make changes to this bug.