Bug 16647 - 6_s1: systemd hangs forever when stopping a service
Summary: 6_s1: systemd hangs forever when stopping a service
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Colin Guthrie
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-25 12:51 CEST by Bit Twister
Modified: 2016-08-31 15:28 CEST (History)
2 users (show)

See Also:
Source RPM: systemd-225-1.mga6.src.rpm
CVE:
Status comment:


Attachments

Description Bit Twister 2015-08-25 12:51:14 CEST
Description of problem:

systemd hangs forever when stopping a service

journalctl will show   Looping too fast. Throttling execution a little

Version-Release number of selected component (if applicable):


How reproducible: always


Steps to Reproduce:
1. click up a terminal
2. su - root
3. xterm journalctl -fa --nopager
4. urpmi  --auto --auto-update

I can not remember if problem showed up after reboot or not but when you see
Looping too fast. Throttling execution a little
anytime a service package install does a stop/restart.


I added /^systemd/ to /etc/urpmi/skip.list to prevent update and started over.
Problem did not happen when running systemd-217


Reproducible: 

Steps to Reproduce:
David Walser 2015-08-25 23:26:44 CEST

Assignee: bugsquad => mageia

Bit Twister 2015-09-05 16:57:06 CEST

Source RPM: systemd-224-3.mag6.x86_64.src.rpm => systemd-225-1.mga6.src.rpm

Comment 1 Bit Twister 2015-09-07 14:50:08 CEST
installing lib64atspi-gir2.0-2.17.90-2.mga6.x86_64.rpm lib64atspi0-2.17.90-2.mga6.x86_64.rpm lib64atk-gir1.0-2.16.0-2.mga6.x86_64.rpm lib64atk1.0-devel-2.16.0-2.mga6.x86_64.rpm at-spi2-core-2.17.90-2.mga6.x86_64.rpm colord-1.2.11-2.mga6.x86_64.rpm lib64atk1.0_0-2.16.0-2.mga6.x86_64.rpm atk1.0-common-2.16.0-2.mga6.x86_64.rpm lib64colord2-1.2.11-2.mga6.x86_64.rpm lib64atk-bridge2.0_0-2.17.90-1.mga6.x86_64.rpm lib64sane1-1.0.24-11.mga6.x86_64.rpm from /var/cache/urpmi/rpms
Preparing...                     #################
 440/2077: at-spi2-core          ###############################
 441/2077: lib64atspi0           #############################
 442/2077: lib64sane1            ###########################
 443/2077: lib64colord2          ##############################
 444/2077: colord                #######################
^Cwarning: %post(colord-1.2.11-2.mga6.x86_64) scriptlet failed, signal 2

Waited hours, before Ctrl c.

# systemctl status colord
  colord.service - Manage, Install and Generate Color Profiles
   Loaded: loaded (/usr/lib/systemd/system/colord.service; static)
   Active: inactive (dead)

colord.service, httpd, cups are normally disabled on my setup to speed shutdown.

Default run level is three
Comment 2 Colin Guthrie 2015-09-07 14:53:11 CEST
Yeah, there is definitely an issue in cauldron. There will be 226 coming very soon, and I'll debug more thoroughly then!
Comment 3 Bit Twister 2015-10-05 03:03:45 CEST
Temporary workaround is to abort urpmi when systemd starts spooling
"Looping too fast. Throttling execution a little" into the journal and reboot.
Comment 4 Colin Guthrie 2015-10-08 15:00:22 CEST
FWIW, there was an issue related to enabling/disabling units that had both systemd native units and legacy sysvinit scripts shipped (the latter being unnecessary in Mageia).

This is now fixed in newer chkconfig. I thought there was a bug report about it, but I can't seem to find it now - the was the closest one I found, so figured I'd mention it here.

Running 227 here so will push it shortly (needs new util-linux which I've also updated but not yet pushed).
Comment 5 Bit Twister 2015-10-08 16:54:24 CEST
(In reply to Colin Guthrie from comment #4)
> FWIW, there was an issue related to enabling/disabling units that had both
> systemd native units and legacy sysvinit scripts shipped (the latter being
> unnecessary in Mageia).
> 
> This is now fixed in newer chkconfig. I thought there was a bug report about
> it, but I can't seem to find it now - the was the closest one I found, so
> figured I'd mention it here.
> 
> Running 227 here so will push it shortly (needs new util-linux which I've
> also updated but not yet pushed).

Can't wait. Guessing it is not fixing the problem but will fix a symptom.
I hit control c to get past each systemd restart which indicates it is not just chkconfig scripts. Then again I bow to your knowledge of what is going on.

If it runs in your setup, turn it loose in cauldron and I'll run another mga5->mag6 upgrade.
Comment 6 Bit Twister 2015-10-13 22:01:35 CEST
last 2 rounds of updates no longer hang. Could be because install procedure changed. Now running step 19 in bug 16894 because my guess was systemd hanging because dbus never returned completion and systemd went into looping wait.

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 7 Colin Guthrie 2015-10-13 22:20:54 CEST
It could certainly have been the chkconfig thing. Either way, glad it's working now. I'm sure I'll have new problems for you soon tho' :)
Comment 8 Colin Guthrie 2015-10-16 15:13:56 CEST
So, FTR, I've also been seeing some problems with the "Looping too fast" thing recently. In systemd v227 it seems to show up for me pretty regularly related to jobs for starting and stopping various socket units related to rfkill switch handling. This has knock on effects and even blocks suspend, reboot, automounts and other types of jobs.

I reported my issue here:
https://github.com/systemd/systemd/issues/1579

but I suspect the root cause is the same as https://github.com/systemd/systemd/issues/1505 which has apparently existed since v225 but has become easier to trigger in v227.

Reverting an old commit (see the latter link) seems to "fix" things for me, although that's not the proper "fix" long term.
Comment 9 Bit Twister 2015-10-16 16:29:27 CEST
(In reply to Colin Guthrie from comment #8)
>
 but I suspect the root cause is the same as
> https://github.com/systemd/systemd/issues/1505 which has apparently existed
> since v225 but has become easier to trigger in v227.
> 
> Reverting an old commit (see the latter link) seems to "fix" things for me,
> although that's not the proper "fix" long term.

I can agree. plugging "Looping too fast. Throttling execution a little" into
https://encrypted.google.com/advanced_search seems to indicate to me there is a basic missed signal logic problem that have been there for awhile.

Also would not hurt for it to back off to a ~5 second rate instead of every second just to conserve journal disk space.

I have seen several intermittent service shutdown timeouts and have resorted to shutting down those services prior a system reboot/shutdown.

I have been trying to hook my shutdown_services script into the shutdown service chain with no luck so far. Snippet of my script follows:


            #****************************
            #* shutdown these apps first
            #****************************
            
    _servers="mythbackend named postfix httpd dovecot stunnel mysqld chronyd dhcpd"
    
    pkill -f thunderbird     # helps dovecot shutdown faster

    for _serv in  $_servers ; do
      _state=$(systemctl is-active $_serv.service 2> /dev/null)
      if [ $? -eq 0 ] ; then
        if [ "$_state" = "active" ] ; then
          echo "Shutting down $_serv"
          systemctl stop $_serv
          _state=$(systemctl is-active $_serv.service 2> /dev/null)
          while [ "$_state" != "inactive" ] ; do
            sleep 1
            _state=$(systemctl is-active $_serv.service 2> /dev/null)
          done
        fi
      fi
    done
Comment 10 Nikita Krupenko 2015-12-08 18:36:34 CET
I have this with systemd hands issue on Cauldron. This may be caused by suspend/resume. Should this report be reopened?

CC: (none) => krnekit

Comment 11 Colin Guthrie 2015-12-10 16:02:07 CET
(In reply to Nikita Krupenko from comment #10)
> I have this with systemd hands issue on Cauldron. This may be caused by
> suspend/resume. Should this report be reopened?

I certainly see it affecting suspend and resume but don't think it's specifically *caused* by it. Do you get the issue with "looping too fast" being logged every second in the journal?

When this happens can you do a "systemctl list-jobs" and let me know what it contains (I suspect it'll have some .socket units in there, specifically the rfkill socket jobs).

If so, I'm actively tracking this upstream.
Comment 12 Nikita Krupenko 2015-12-10 16:13:28 CET
Yes, I see this in journal. I'll try your command next time I'll
encounter this issue.

Ok, so I reopen it.

Status: RESOLVED => REOPENED
Resolution: FIXED => (none)

Comment 13 Nikita Krupenko 2015-12-12 14:03:47 CET
Hi! Here is the list of jobs:

  JOB UNIT                              TYPE  STATE  
55225 NetworkManager-dispatcher.service start waiting
55224 systemd-rfkill.socket             stop  waiting

2 jobs listed.

Network is working and connected through ethernet. The wi-fi is turned off right now.
Comment 14 Nikita Krupenko 2015-12-12 14:08:07 CET
Also, from journal:

Dec 12 13:49:34 localhost.localdomain systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start timed out.
Dec 12 13:49:34 localhost.localdomain systemd[1]: Timed out waiting for device sys-devices-virtual-misc-rfkill.device.
Dec 12 13:49:34 localhost.localdomain systemd[1]: Dependency failed for Load/Save RF Kill Switch Status.
Dec 12 13:49:34 localhost.localdomain systemd[1]: systemd-rfkill.service: Job systemd-rfkill.service/start failed with result 'dependency'.
Dec 12 13:49:34 localhost.localdomain systemd[1]: systemd-rfkill.socket: Unit is bound to inactive unit sys-devices-virtual-misc-rfkill.device. Stopping, too.
Dec 12 13:49:34 localhost.localdomain systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start failed with result 'timeout'.

And then it starts "looping too fast".

This happened after I resume suspended laptop.
Comment 15 Colin Guthrie 2015-12-14 10:52:53 CET
Yeah definitely the same issue I'm getting too. Trying to nail it but keep missing upstream folks.

It's something to do with reloading systemd (i.e. if you never call systemctl reload it doesn't seem to trigger - this will often happen as a result of installing/updating a package or enabling/disabling a unit etc).

Anyway it's being tracked here:

https://github.com/systemd/systemd/issues/1579

Not sure if it's Mageia Only, so will have to try some "purer" packages at some point (i.e. no downstream patches)
Comment 16 Nikita Krupenko 2015-12-14 13:04:02 CET
Yes, I usually found, that systemd hangs when installing new kernel (with broadcom-wl module) and post scriptlet hangs too.
Comment 17 Colin Guthrie 2015-12-14 13:38:32 CET
When this hang occurs it can (usually) be cleared out by running the following command in another window:

systemctl list-jobs | grep -vi JOB | sort -u | grep rfkill.*waiting | cut -d' ' -f1 | xargs systemctl cancel

Rubbish work around, but I try and keep my system in this state a while so I can do various debugging (none of which is useful it seems) and this at least lets me carry on.
Comment 18 Nikita Krupenko 2015-12-14 14:26:11 CET
I'll try this next time.
Usually, it makes unable for my system to reboot/power off (and sleep too) as it is done now by systemd, so I'm using SysRq keys for this :)
Comment 19 Colin Guthrie 2015-12-14 14:49:25 CET
FWIW, a better workaround is to mask the relevant units. You lose the loading/saving of rfkill status but that's hardly a major headache compared to the bug!

systemctl mask systemd-rfkill.service
systemctl mask systemd-rfkill.socket
systemctl mask sys-devices-virtual-misc-rfkill.device

At some point you'll want to unmask them again, but this will make general operation much nicer in the mean time!
Florian Hubold 2015-12-14 21:40:07 CET

CC: (none) => doktor5000

Comment 20 Bit Twister 2016-07-06 20:17:31 CEST
Problem no longer occurs.

Colin will have to decide if this bug is resolved.

Summary: mga6: systemd hangs forever when stopping a service => 6_s1: systemd hangs forever when stopping a service

Comment 21 Colin Guthrie 2016-08-31 15:28:04 CEST
Seems my rfkill issue also has a fix upstream now too (which I've applied). This could very easily have been a completely separate issue, so if you think it's fixed and I think the rfkill issue is fixed, then let's call this bug fixed!

Status: REOPENED => RESOLVED
Resolution: (none) => FIXED


Note You need to log in before you can comment on or make changes to this bug.